Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

Closed
chipkent opened this issue Sep 2, 2016 · 18 comments
Closed

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

chipkent opened this issue Sep 2, 2016 · 18 comments
Labels
needs docs Documentation for this change is required performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Comments

@chipkent
Copy link

chipkent commented Sep 2, 2016

After porting my code to Julia v0.5.0, I am seeing a 3x reduction in speed in a critical area. I have narrowed the root cause down to what appears to be slow dictionary iteration in v0.5.0.

Reproducer code:
https://gist.github.com/chipkent/c8e3acc4f64bd78db4fe40f2c42072f2

---- Julia v0.5.0-rc3
julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

julia> @benchmark lratio2(dd)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  816.00 bytes
  allocs estimate:  31
  minimum time:     35.40 μs (0.00% GC)
  median time:      35.88 μs (0.00% GC)
  mean time:        36.08 μs (0.00% GC)
  maximum time:     59.26 μs (0.00% GC)
---- Julia v0.5.0-rc3
julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> @benchmark lratio2(dd)

BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  816.00 bytes
  allocs estimate:  31
  minimum time:     2.73 μs (0.00% GC)
  median time:      2.77 μs (0.00% GC)
  mean time:        2.90 μs (2.74% GC)
  maximum time:     339.18 μs (98.62% GC)
@ViralBShah ViralBShah added the performance Must go faster label Sep 2, 2016
@ViralBShah ViralBShah added this to the 0.5.x milestone Sep 2, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Sep 2, 2016

I think this is a dup of #16190.

@ViralBShah
Copy link
Member

It certainly will be nice to have a fix for this in 0.5.

@yuyichao
Copy link
Contributor

yuyichao commented Sep 2, 2016

Yes it is dup of #16190, changing the value type to a isbits type removes the allocation and the performance difference. This basically needs a even more aggressive version of #16021 which was blocked by codegen bug. It should be possible to do now since #18163 is merged. It's unlikely going to be backportable to 0.5 though.

I'm not sure how the allocation elimination works on 0.4 so not sure how this regressed. Likely related to linearizing the IR.

@yuyichao yuyichao closed this as completed Sep 2, 2016
@ViralBShah
Copy link
Member

Cc @vtjnash

@yuyichao
Copy link
Contributor

yuyichao commented Sep 2, 2016

Actually, profiling shows that the allocation was in charge of the 500ns -> 3us performance difference (which actually didn't regress from 0.4), the big regression seems to actually come from typemap so although fixing #16190 will fix this, there's more performance issue here (we might also have an issue for this already but I'm not sure).

So indeed @vtjnash

@yuyichao yuyichao reopened this Sep 2, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Sep 2, 2016

The profile

101102 /home/yuyichao/script.jl:6; lratio2(::Dict{Int64,Tuple{Float64,Any}})
 98648 /build/julia-git/src/julia-avx2/src/gf.c:1890; jl_apply_generic
  98475 /build/julia-git/src/julia-avx2/src/typemap.c:862; jl_typemap_level_assoc_exact
   20760 /build/julia-git/src/julia-avx2/src/typemap.c:798; jl_typemap_entry_assoc_exact
   2282  /build/julia-git/src/julia-avx2/src/typemap.c:799; jl_typemap_entry_assoc_exact
   68494 /build/julia-git/src/julia-avx2/src/typemap.c:817; jl_typemap_entry_assoc_exact
    6440  /build/julia-git/src/julia-avx2/src/typemap.c:120; sig_match_simple
    8133  /build/julia-git/src/julia-avx2/src/typemap.c:135; sig_match_simple
     8079 /build/julia-git/src/julia-avx2/src/julia.h:969; jl_is_type_type
    4129  /build/julia-git/src/julia-avx2/src/typemap.c:136; sig_match_simple
     4129 /build/julia-git/src/julia-avx2/src/julia.h:690; jl_svecref
    2040  /build/julia-git/src/julia-avx2/src/typemap.c:137; sig_match_simple
    7434  /build/julia-git/src/julia-avx2/src/typemap.c:142; sig_match_simple
    22112 /build/julia-git/src/julia-avx2/src/typemap.c:147; sig_match_simple
     4963 /build/julia-git/src/julia-avx2/src/jltypes.c:1664; type_eqv__.part.11
     4417 /build/julia-git/src/julia-avx2/src/jltypes.c:1668; type_eqv__.part.11
     2836 /build/julia-git/src/julia-avx2/src/jltypes.c:1685; type_eqv__.part.11
     2676 /build/julia-git/src/julia-avx2/src/jltypes.c:1706; type_eqv__.part.11
    4800  /build/julia-git/src/julia-avx2/src/typemap.c:169; sig_match_simple

@vtjnash
Copy link
Member

vtjnash commented Sep 2, 2016

I'm not sure why dispatching on this particular type seems to be much more expensive, although it is a complex type so it has to go through type_eqv and the sort order of the table is different now. The real issue is that methods is returning too many matches:

julia> Base._methods_by_ftype(Tuple{typeof(convert), Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}}, Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}}, -1)
2-element Array{Any,1}:
 svec(Tuple{Base.#convert,Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},_<:Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},svec(Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}),convert{T<:Tuple{Any,Vararg{Any,N}}}(::Type{T}, x::T) at essentials.jl:69)                         
 svec(Tuple{Base.#convert,Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},svec(Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}),convert{T<:Tuple{Any,Vararg{Any,N}}}(::Type{T}, x::Tuple{Any,Vararg{Any,N<:Any}}) at essentials.jl:67)

also looking at this method lookup result, it looks like the type system is confused and returns an inappropriate typevar from jl_lookup_match. so while env is correct, jl_typemap_assoc_by_type_ believes this is an inexact match:

(lldb) p jl_($15->sig)
Tuple{Base.#convert, Type{#T<:Tuple{Any, Vararg{Any, N<:Any}}}, #T<:Tuple{Any, Vararg{Any, N<:Any}}}
(lldb) p jl_svec_t *env = jl_emptysvec; jl_lookup_match(v, (jl_value_t*)$15->sig, &env, $15->tvars);
(jl_value_t *) $18 = 0x000000010c55e0b0
(lldb) p jl_($18)
Tuple{Base.#convert, Type{Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}}}, _<:Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}}}
(lldb) p jl_(env)
svec(Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}})

another side issue too is that jl_get_specialization1 is very pessimistic, but also wrong:

diff --git a/src/gf.c b/src/gf.c
index 0e69c74..21d2a7d 100644
--- a/src/gf.c
+++ b/src/gf.c
@@ -1332,9 +1332,9 @@ jl_lambda_info_t *jl_get_specialization1(jl_tupletype_t *types)
     for (i = 0; i < jl_nparams(types); i++) {
         jl_value_t *ti = jl_tparam(types, i);
         // if one argument type is DataType, multiple Type{} definitions
-        // might match. also be conservative with tuples rather than trying
-        // to analyze them in detail.
-        if (ti == (jl_value_t*)jl_datatype_type || jl_is_tuple_type(ti)) {
+        // might match.
+        // /* also be conservative with tuple types rather than trying to analyze them in detail. */
+        if (is_kind(ti)) {
             jl_value_t *matches = jl_matching_methods(types, 1, 0);
             if (matches == jl_false)
                 return NULL;

@JeffBezanson JeffBezanson added the regression Regression in behavior compared to a previous version label Sep 2, 2016
@vtjnash
Copy link
Member

vtjnash commented Sep 4, 2016

the big regression seems to actually come from typemap so

The time is being spent in typemap, but that's not the regression. This case wasn't optimized at all in either the old or new code. But the convert method cache has about 10x the number of entries in v0.5. With that difference subtracted out (by resetting the cache for convert), the performance is identical. It looks like the real failure is that the way convert is defined for tuples (as a tail call) can be extremely expensive for the cache. On v0.4, the benchmark just got lucky that nobody had called that method to generate any specializations of it (before the benchmark got called).

@ViralBShah
Copy link
Member

Can we get the convert definition fix in for 0.5?

@vtjnash vtjnash removed the regression Regression in behavior compared to a previous version label Sep 5, 2016
@vtjnash
Copy link
Member

vtjnash commented Sep 6, 2016

fwiw, it's easy to make a version that is 2x faster than OP (and same timing on v0.4 / v0.5):

dd = Dict{FT, Any}()
function lratio2(dd::Dict)
         num::Float64 = 0.0
         denom::Float64 = 0

         for (k, v) in dd
           pDiscrete = v[1]
           pContinuousGivenDiscrete = 1.2
           denom += pDiscrete * pContinuousGivenDiscrete
         end

         return num / denom
 end

@vtjnash vtjnash removed this from the 0.5.x milestone Sep 7, 2016
@vtjnash
Copy link
Member

vtjnash commented Sep 7, 2016

removing the milestone since it's easy to fix the user code but perhaps difficult to teach the compiler to do that same transform

@tkelman
Copy link
Contributor

tkelman commented Sep 7, 2016

Then we should document (now) what needs to be done and what kind of code is likely to hit this.

@kshyatt kshyatt added the needs docs Documentation for this change is required label Sep 7, 2016
@chipkent
Copy link
Author

chipkent commented Sep 8, 2016

I'm trying to understand and test the user code fix. Are you suggesting that

function lratio2{FT}(dd::Dict{FT,Tuple{Float64, ContinuousUnivariateDistribution}})

be changed to

function lratio2(dd::Dict)

... or am I missing something? If so, this change is not a work around and still experiences the same slowness.

@vtjnash
Copy link
Member

vtjnash commented Sep 8, 2016

No, I'm suggesting that parametizing Dict is the source of your performance issue. Making the function signature less restrictive was just a necessary side-effect to allow that easily

@JeffBezanson
Copy link
Member

I don't think using Any as the value type is an acceptable workaround. It's a significant change to the code, and could easily slow down the program elsewhere.

@chipkent
Copy link
Author

chipkent commented Sep 8, 2016

With the recommended changes, the "real" code is not showing any improvements. I'm afraid it may be slowing down elsewhere, as suggested. I'll look more into the details when I get a chance.

@chipkent
Copy link
Author

chipkent commented Sep 8, 2016

I've identified what is going on.

  1. The "Any" fix did improve the performance.
  2. The "Any" fix could have slowed down the program elsewhere -- fortunately, it didn't.
  3. I had to make changes in multiple places in my code. The function was implemented for an abstract type, with the Dict as a field of the type. I had to change all of the subtypes of this abstract type for the performance problem to be resolved in all cases. In my initial code change, I had missed changes to some types.

Overall, this fix resolves the immediate problem, but it is not a good solution for the language, especially when the language is being sold as 1) high performance and 2) designed so that static analysis of the code is easy. I'm certain that similar iteration is a very common use case.

@tkelman
Copy link
Contributor

tkelman commented Apr 21, 2017

It looks like this was fixed on 0.6, probably by #21377. I get medians of

  • master: 275 ns
  • 0.5: 35 μs
  • 0.4: 3.1 μs

This would be worth adding to BaseBenchmarks.jl so it can be automatically tracked for regressions.

@tkelman tkelman closed this as completed Apr 21, 2017
@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Apr 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs docs Documentation for this change is required performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

7 participants