Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

chipkent · 2016-09-02T03:56:48Z

After porting my code to Julia v0.5.0, I am seeing a 3x reduction in speed in a critical area. I have narrowed the root cause down to what appears to be slow dictionary iteration in v0.5.0.

Reproducer code:
https://gist.github.com/chipkent/c8e3acc4f64bd78db4fe40f2c42072f2

---- Julia v0.5.0-rc3
julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

julia> @benchmark lratio2(dd)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  816.00 bytes
  allocs estimate:  31
  minimum time:     35.40 μs (0.00% GC)
  median time:      35.88 μs (0.00% GC)
  mean time:        36.08 μs (0.00% GC)
  maximum time:     59.26 μs (0.00% GC)

---- Julia v0.5.0-rc3
julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> @benchmark lratio2(dd)

BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  816.00 bytes
  allocs estimate:  31
  minimum time:     2.73 μs (0.00% GC)
  median time:      2.77 μs (0.00% GC)
  mean time:        2.90 μs (2.74% GC)
  maximum time:     339.18 μs (98.62% GC)

yuyichao · 2016-09-02T09:57:46Z

I think this is a dup of #16190.

ViralBShah · 2016-09-02T10:03:08Z

It certainly will be nice to have a fix for this in 0.5.

yuyichao · 2016-09-02T10:03:50Z

Yes it is dup of #16190, changing the value type to a isbits type removes the allocation and the performance difference. This basically needs a even more aggressive version of #16021 which was blocked by codegen bug. It should be possible to do now since #18163 is merged. It's unlikely going to be backportable to 0.5 though.

I'm not sure how the allocation elimination works on 0.4 so not sure how this regressed. Likely related to linearizing the IR.

ViralBShah · 2016-09-02T10:07:38Z

Cc @vtjnash

yuyichao · 2016-09-02T10:32:20Z

Actually, profiling shows that the allocation was in charge of the 500ns -> 3us performance difference (which actually didn't regress from 0.4), the big regression seems to actually come from typemap so although fixing #16190 will fix this, there's more performance issue here (we might also have an issue for this already but I'm not sure).

So indeed @vtjnash

yuyichao · 2016-09-02T10:46:14Z

The profile

101102 /home/yuyichao/script.jl:6; lratio2(::Dict{Int64,Tuple{Float64,Any}})
 98648 /build/julia-git/src/julia-avx2/src/gf.c:1890; jl_apply_generic
  98475 /build/julia-git/src/julia-avx2/src/typemap.c:862; jl_typemap_level_assoc_exact
   20760 /build/julia-git/src/julia-avx2/src/typemap.c:798; jl_typemap_entry_assoc_exact
   2282  /build/julia-git/src/julia-avx2/src/typemap.c:799; jl_typemap_entry_assoc_exact
   68494 /build/julia-git/src/julia-avx2/src/typemap.c:817; jl_typemap_entry_assoc_exact
    6440  /build/julia-git/src/julia-avx2/src/typemap.c:120; sig_match_simple
    8133  /build/julia-git/src/julia-avx2/src/typemap.c:135; sig_match_simple
     8079 /build/julia-git/src/julia-avx2/src/julia.h:969; jl_is_type_type
    4129  /build/julia-git/src/julia-avx2/src/typemap.c:136; sig_match_simple
     4129 /build/julia-git/src/julia-avx2/src/julia.h:690; jl_svecref
    2040  /build/julia-git/src/julia-avx2/src/typemap.c:137; sig_match_simple
    7434  /build/julia-git/src/julia-avx2/src/typemap.c:142; sig_match_simple
    22112 /build/julia-git/src/julia-avx2/src/typemap.c:147; sig_match_simple
     4963 /build/julia-git/src/julia-avx2/src/jltypes.c:1664; type_eqv__.part.11
     4417 /build/julia-git/src/julia-avx2/src/jltypes.c:1668; type_eqv__.part.11
     2836 /build/julia-git/src/julia-avx2/src/jltypes.c:1685; type_eqv__.part.11
     2676 /build/julia-git/src/julia-avx2/src/jltypes.c:1706; type_eqv__.part.11
    4800  /build/julia-git/src/julia-avx2/src/typemap.c:169; sig_match_simple

vtjnash · 2016-09-02T15:05:18Z

I'm not sure why dispatching on this particular type seems to be much more expensive, although it is a complex type so it has to go through type_eqv and the sort order of the table is different now. The real issue is that methods is returning too many matches:

julia> Base._methods_by_ftype(Tuple{typeof(convert), Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}}, Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}}, -1)
2-element Array{Any,1}:
 svec(Tuple{Base.#convert,Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},_<:Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},svec(Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}),convert{T<:Tuple{Any,Vararg{Any,N}}}(::Type{T}, x::T) at essentials.jl:69)                         
 svec(Tuple{Base.#convert,Type{Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}},svec(Tuple{Float64,Distributions.Distribution{Distributions.Univariate,Distributions.Continuous}}),convert{T<:Tuple{Any,Vararg{Any,N}}}(::Type{T}, x::Tuple{Any,Vararg{Any,N<:Any}}) at essentials.jl:67)

also looking at this method lookup result, it looks like the type system is confused and returns an inappropriate typevar from jl_lookup_match. so while env is correct, jl_typemap_assoc_by_type_ believes this is an inexact match:

(lldb) p jl_($15->sig)
Tuple{Base.#convert, Type{#T<:Tuple{Any, Vararg{Any, N<:Any}}}, #T<:Tuple{Any, Vararg{Any, N<:Any}}}
(lldb) p jl_svec_t *env = jl_emptysvec; jl_lookup_match(v, (jl_value_t*)$15->sig, &env, $15->tvars);
(jl_value_t *) $18 = 0x000000010c55e0b0
(lldb) p jl_($18)
Tuple{Base.#convert, Type{Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}}}, _<:Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}}}
(lldb) p jl_(env)
svec(Tuple{Float64, Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}})

another side issue too is that jl_get_specialization1 is very pessimistic, but also wrong:

diff --git a/src/gf.c b/src/gf.c
index 0e69c74..21d2a7d 100644
--- a/src/gf.c
+++ b/src/gf.c
@@ -1332,9 +1332,9 @@ jl_lambda_info_t *jl_get_specialization1(jl_tupletype_t *types)
     for (i = 0; i < jl_nparams(types); i++) {
         jl_value_t *ti = jl_tparam(types, i);
         // if one argument type is DataType, multiple Type{} definitions
-        // might match. also be conservative with tuples rather than trying
-        // to analyze them in detail.
-        if (ti == (jl_value_t*)jl_datatype_type || jl_is_tuple_type(ti)) {
+        // might match.
+        // /* also be conservative with tuple types rather than trying to analyze them in detail. */
+        if (is_kind(ti)) {
             jl_value_t *matches = jl_matching_methods(types, 1, 0);
             if (matches == jl_false)
                 return NULL;

vtjnash · 2016-09-04T06:23:23Z

the big regression seems to actually come from typemap so

The time is being spent in typemap, but that's not the regression. This case wasn't optimized at all in either the old or new code. But the convert method cache has about 10x the number of entries in v0.5. With that difference subtracted out (by resetting the cache for convert), the performance is identical. It looks like the real failure is that the way convert is defined for tuples (as a tail call) can be extremely expensive for the cache. On v0.4, the benchmark just got lucky that nobody had called that method to generate any specializations of it (before the benchmark got called).

ViralBShah · 2016-09-04T06:38:55Z

Can we get the convert definition fix in for 0.5?

vtjnash · 2016-09-06T16:38:09Z

fwiw, it's easy to make a version that is 2x faster than OP (and same timing on v0.4 / v0.5):

dd = Dict{FT, Any}()
function lratio2(dd::Dict)
         num::Float64 = 0.0
         denom::Float64 = 0

         for (k, v) in dd
           pDiscrete = v[1]
           pContinuousGivenDiscrete = 1.2
           denom += pDiscrete * pContinuousGivenDiscrete
         end

         return num / denom
 end

vtjnash · 2016-09-07T02:33:51Z

removing the milestone since it's easy to fix the user code but perhaps difficult to teach the compiler to do that same transform

tkelman · 2016-09-07T03:32:38Z

Then we should document (now) what needs to be done and what kind of code is likely to hit this.

chipkent · 2016-09-08T04:18:04Z

I'm trying to understand and test the user code fix. Are you suggesting that

function lratio2{FT}(dd::Dict{FT,Tuple{Float64, ContinuousUnivariateDistribution}})

be changed to

function lratio2(dd::Dict)

... or am I missing something? If so, this change is not a work around and still experiences the same slowness.

vtjnash · 2016-09-08T04:20:43Z

No, I'm suggesting that parametizing Dict is the source of your performance issue. Making the function signature less restrictive was just a necessary side-effect to allow that easily

JeffBezanson · 2016-09-08T04:35:48Z

I don't think using Any as the value type is an acceptable workaround. It's a significant change to the code, and could easily slow down the program elsewhere.

chipkent · 2016-09-08T04:42:06Z

With the recommended changes, the "real" code is not showing any improvements. I'm afraid it may be slowing down elsewhere, as suggested. I'll look more into the details when I get a chance.

chipkent · 2016-09-08T23:42:49Z

I've identified what is going on.

The "Any" fix did improve the performance.
The "Any" fix could have slowed down the program elsewhere -- fortunately, it didn't.
I had to make changes in multiple places in my code. The function was implemented for an abstract type, with the Dict as a field of the type. I had to change all of the subtypes of this abstract type for the performance problem to be resolved in all cases. In my initial code change, I had missed changes to some types.

Overall, this fix resolves the immediate problem, but it is not a good solution for the language, especially when the language is being sold as 1) high performance and 2) designed so that static analysis of the code is easy. I'm certain that similar iteration is a very common use case.

tkelman · 2017-04-21T11:04:36Z

It looks like this was fixed on 0.6, probably by #21377. I get medians of

master: 275 ns
0.5: 35 μs
0.4: 3.1 μs

This would be worth adding to BaseBenchmarks.jl so it can be automatically tracked for regressions.

ViralBShah added the performance Must go faster label Sep 2, 2016

ViralBShah added this to the 0.5.x milestone Sep 2, 2016

yuyichao closed this as completed Sep 2, 2016

yuyichao reopened this Sep 2, 2016

JeffBezanson added the regression Regression in behavior compared to a previous version label Sep 2, 2016

yuyichao mentioned this issue Sep 3, 2016

methods returns wrong result for triangular dispatch #18348

Closed

vtjnash removed the regression Regression in behavior compared to a previous version label Sep 5, 2016

vtjnash mentioned this issue Sep 5, 2016

WIP: typemap split on non-leaf TypeNames #16418

Closed

vtjnash removed this from the 0.5.x milestone Sep 7, 2016

kshyatt added the needs docs Documentation for this change is required label Sep 7, 2016

tkelman closed this as completed Apr 21, 2017

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Apr 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

chipkent commented Sep 2, 2016 •

edited by StefanKarpinski

Loading

yuyichao commented Sep 2, 2016

ViralBShah commented Sep 2, 2016

yuyichao commented Sep 2, 2016

ViralBShah commented Sep 2, 2016

yuyichao commented Sep 2, 2016 •

edited

Loading

yuyichao commented Sep 2, 2016

vtjnash commented Sep 2, 2016

vtjnash commented Sep 4, 2016

ViralBShah commented Sep 4, 2016

vtjnash commented Sep 6, 2016

vtjnash commented Sep 7, 2016

tkelman commented Sep 7, 2016 •

edited

Loading

chipkent commented Sep 8, 2016

vtjnash commented Sep 8, 2016

JeffBezanson commented Sep 8, 2016

chipkent commented Sep 8, 2016

chipkent commented Sep 8, 2016

tkelman commented Apr 21, 2017

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

Julia v0.5.0 dictionary iteration is >10x slower than v0.4.6 #18331

Comments

chipkent commented Sep 2, 2016 • edited by StefanKarpinski Loading

yuyichao commented Sep 2, 2016

ViralBShah commented Sep 2, 2016

yuyichao commented Sep 2, 2016

ViralBShah commented Sep 2, 2016

yuyichao commented Sep 2, 2016 • edited Loading

yuyichao commented Sep 2, 2016

vtjnash commented Sep 2, 2016

vtjnash commented Sep 4, 2016

ViralBShah commented Sep 4, 2016

vtjnash commented Sep 6, 2016

vtjnash commented Sep 7, 2016

tkelman commented Sep 7, 2016 • edited Loading

chipkent commented Sep 8, 2016

vtjnash commented Sep 8, 2016

JeffBezanson commented Sep 8, 2016

chipkent commented Sep 8, 2016

chipkent commented Sep 8, 2016

tkelman commented Apr 21, 2017

chipkent commented Sep 2, 2016 •

edited by StefanKarpinski

Loading

yuyichao commented Sep 2, 2016 •

edited

Loading

tkelman commented Sep 7, 2016 •

edited

Loading