Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow broadcast #560

Closed
dkarrasch opened this issue Dec 6, 2018 · 15 comments · Fixed by #1079
Closed

Very slow broadcast #560

dkarrasch opened this issue Dec 6, 2018 · 15 comments · Fixed by #1079
Labels
performance runtime performance

Comments

@dkarrasch
Copy link
Contributor

At SciML/OrdinaryDiffEq.jl#571 an instance of very slow broadcast was identified:

using BenchmarkTools, StaticArrays

f(ũ, u₀, u₁, ρ) = ũ ./ (1e-6 .+ max.(abs.(u₀), abs.(u₁)) .* ρ)
ũ, u₀, u₁, ρ = rand(3), rand(3), rand(3), rand(3)
@btime f($ũ, $u₀, $u₁, $ρ) # 47.473 ns (1 allocation: 112 bytes)

ũ, u₀, u₁, ρ = rand(SVector{3}), rand(SVector{3}), rand(SVector{3}), rand(SVector{3})
@btime f($ũ, $u₀, $u₁, $ρ) # 2.354 μs (44 allocations: 3.59 KiB)
@KristofferC
Copy link
Contributor

On Julia master

julia> @btime f($ũ, $u₀, $u₁, $ρ)
  89.981 ns (12 allocations: 192 bytes)

@dkarrasch
Copy link
Contributor Author

I'm sorry, yes, I forgot to mention/copy that we also identified that this seems to be a Julia v1.0.2 issue.

@tkoolen
Copy link
Contributor

tkoolen commented Dec 6, 2018

I wonder if this is because of #539. Perhaps the use of max instead of _max is only inferable on master? It wouldn't be hard to switch between the two based on VERSION if that's the case.

@dkarrasch
Copy link
Contributor Author

Is v.0.10.1 already tagged? #539 seems to be only in the latest version. I did my tests with v0.10.0.

@tkoolen
Copy link
Contributor

tkoolen commented Dec 6, 2018

No, 91743c9 is in 0.10.0 as well.

@tkoolen
Copy link
Contributor

tkoolen commented Dec 6, 2018

The benchmark doesn't run that code though, so that's not it.

@dkarrasch
Copy link
Contributor Author

No, it's slow on v0.9 and on v0.8.3... 🤔

@dkarrasch
Copy link
Contributor Author

I don't think I can contribute concretely to the resolution of the problem, but let me share the results of some experiments. Hopefully somebody finds them helpful.

# preparation
using BenchmarkTools, StaticArrays
ũ, u₀, u₁, ρ = rand(3), rand(3), rand(3), rand(3)
u, u0, u1, r = convert.(SVector{3}, (ũ, u₀, u₁, ρ))
# function of interest
Fdot(ũ, u₀, u₁, ρ) = ũ ./ (max.(abs.(u₀), abs.(u₁)) .* ρ)

The following parts of Fdot are fast:

f(ũ, u₀, u₁, ρ) = ũ .* ρ
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

f(ũ, u₀, u₁, ρ) = ũ ./ ρ
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

f(ũ, u₀, u₁, ρ) = (max.(abs.(u₀), abs.(u₁)))
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

f(ũ, u₀, u₁, ρ) = ũ ./ (max.(abs.(u₀), abs.(u₁)))
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

Here is a first workaround, which hits the normal Vector case.

f(ũ, ρ) = ũ ./ ρ
g(ũ, u₀, u₁) = ũ ./ (max.(abs.(u₀), abs.(u₁)))
@btime g(f($ũ, $ρ), $u₀, $u₁)
@btime g(f($u, $r), $u0, $u1)

The following parts/versions of Fdot are slow:

f(ũ, u₀, u₁, ρ) = (max.(abs.(u₀), abs.(u₁)) .* ρ)
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

f(ũ, u₀, u₁, ρ) = ũ ./ max.(abs.(u₀), abs.(u₁)) ./ ρ
@btime f($ũ, $u₀, $u₁, $ρ)
@btime f($u, $u0, $u1, $r)

A second workaround is to define the function elementwise (without broadcasting), and then apply it via broadcasting:

F(ũ, u₀, u₁, ρ) = ũ / (max(abs(u₀), abs(u₁)) * ρ)
@btime F.($ũ, $u₀, $u₁, $ρ)
@btime F.($u, $u0, $u1, $r)

This hits the Vector case even harder.

@YingboMa
Copy link
Contributor

YingboMa commented Dec 8, 2018

Julia 1.0.2:

│ ─ %-1 = invoke Base.Broadcast.make_makeargs(::Function,::Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}},Float64}}})::Type{getfield(Base.Broadcast, Symbol("##7#8")){Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}},Float64}},getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}},getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}},_1}} where _1
Body::getfield(Base.Broadcast, Symbol("##7#8")){Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}},Float64}},getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}},getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}},_1} where _1
334 1%1  = (Base.getfield)(t, 1, true)::Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}},Float64}}
336%2  = (Base.getfield)(%1, :args)::Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}},Float64}
    │   %3  = (Base.getfield)(%2, 1, true)::Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(max),Tuple{Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}},Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(abs),Tuple{SArray{Tuple{3},Float64,1,3}}}}}
    │   %4  = Base.Broadcast.:(##5#6)::Type{getfield(Base.Broadcast, Symbol("##5#6"))}                                                 ││╻    make_makeargs%5  = (Core.typeof)(makeargs)::Type{#s55} where #s55<:Function                                                                 │││
...

I think it is fixed by JuliaLang/julia@67b06af

@dkarrasch
Copy link
Contributor Author

dkarrasch commented Dec 9, 2018

I don't think so. That patch is included in Julia v1.0.2, in which the problem occurs. Note also that the runtime on Julia master is far from optimal. My second workaround takes 5 ns, as opposed to the 90 ns reported by @KristofferC. Anyway, the 90 ns are slower than the usual Vector case.

In some sense, the broadcast needed here is of a "trivial" kind. All broadcasted operations should be fused (not sure that's the correct terminology), and the result should be the same as my second workaround, which first fuses manually, and then calls broadcasting.

@dkarrasch
Copy link
Contributor Author

dkarrasch commented Jan 7, 2019

I think this issue can be considered as (magically) sort of resolved. For the original problem on

Julia Version 1.1.0-rc1.0
Commit ba87aa3962 (2018-12-31 23:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

I now get

using BenchmarkTools, StaticArrays

f(ũ, u₀, u₁, ρ) = ũ ./ (1e-6 .+ max.(abs.(u₀), abs.(u₁)) .* ρ)
ũ, u₀, u₁, ρ = rand(3), rand(3), rand(3), rand(3)
@btime f($ũ, $u₀, $u₁, $ρ) # 48.112 ns (1 allocation: 112 bytes)

u, u0, u1, r = convert.(SVector{3}, (ũ, u₀, u₁, ρ))
@btime f($u, $u0, $u1, $r) # 11.436 ns (0 allocations: 0 bytes)

That's, however, still much more than my "second workaround" from above, which yields the same numerical result:

f(ũ, u₀, u₁, ρ) = ũ / (1e-6 + max(abs(u₀), abs(u₁)) * ρ)
@btime f.($u, $u0, $u1, $r) #  0.027 ns (0 allocations: 0 bytes)
f(ũ, u₀, u₁, ρ) = @. ũ / (1e-6 + max(abs(u₀), abs(u₁)) * ρ)
@btime f($u, $u0, $u1, $r) # 11.439 ns (0 allocations: 0 bytes)

Edit: Note that this is also much better than the previous master-runtime, see @KristofferC's #560 (comment).

@tkoolen
Copy link
Contributor

tkoolen commented Jan 7, 2019

@btime f.($u, $u0, $u1, $r) # 0.027 ns (0 allocations: 0 bytes)

0.027 ns is less than a clock cycle, so everything has been constant-folded, (i.e. the computation happened at compile time because the compiler recognized that pure functions were being called on constant inputs). A better microbenchmark would be

julia> @btime f.(u, u0, u1, r) setup = begin
           ũ, u₀, u₁, ρ = rand(3), rand(3), rand(3), rand(3)
           u, u0, u1, r = convert.(SVector{3}, (ũ, u₀, u₁, ρ))
       end
  5.413 ns (0 allocations: 0 bytes)

Still a difference though.

@dkarrasch
Copy link
Contributor Author

Yes, I was surprised about that computation time, but didn't know what to do about it. Anyway, there seems to be room for improvement, but 11 ns are still much better than 2µs with which we started. I'll leave the issue open as a reminder to revisit the issue once in a while, since we haven't really understood the original issue and how it got resolved, AFAICT.

@c42f
Copy link
Member

c42f commented Jul 31, 2019

Seems this might still be an issue, cf. SciML/DifferentialEquations.jl#436

Needs detailed investigation to figure out if we can do something about it or whether we need a fix in Base.

@mateuszbaran
Copy link
Collaborator

Small update: I've proposed a solution here: JuliaLang/julia#41090 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance runtime performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants