Improve count performance #40564

bkamins · 2021-04-22T11:52:56Z

This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler).

Here is an example of the performance difference:

julia> x = rand([1, missing], 10^6);

julia> @btime count(ismissing, $x)
  501.200 μs (0 allocations: 0 bytes)
500396

julia> @btime count(Base.Generator(ismissing, $x))
  70.800 μs (0 allocations: 0 bytes)
500396

(this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)

This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler). Here is an example of the performance difference: ``` julia> x = rand([1, missing], 10^6); julia> @Btime count(ismissing, $x) 501.200 μs (0 allocations: 0 bytes) 500396 julia> @Btime count(Base.Generator(ismissing, $x)) 70.800 μs (0 allocations: 0 bytes) 500396 ``` (this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)

base/reduce.jl

KristofferC · 2021-04-22T12:20:22Z

Some notes, looking at:

@code_llvm Base._simple_count(ismissing, x, 0)

and

 @code_llvm Base._simple_count(identity, Base.Generator(ismissing, x), 0)

the second case manages to vectorize. The LLVM vectorization pass reports:

remark: reduce.jl:969:0: loop not vectorized: loop control flow is not understood by vectorizer
remark: reduce.jl:969:0: loop not vectorized: loop control flow is not understood by analyzer
remark: reduce.jl:969:0: loop not vectorized

in the first case but

remark: array.jl:777:0: vectorized loop (vectorization width: 4, interleaved count: 4)

in the second case.

KristofferC · 2021-04-22T12:58:43Z

base/reduce.jl

-    for x in itr
-        n += pred(x)::Bool
+    for x in g
+        n += x::Bool


Is this just n += true now?

no pred can return true or false, e.g.:

julia> collect(Base.Generator(ismissing, [1,missing,2])) 3-element Vector{Bool}: 0 1 0

and we still need ::Bool assertion

Oh yeah, somehow I thought the first argument was a filter with this syntax.

That is why I was surprised that the compiler does not vectorize it (the only thing I do is introduce Generator wrapper and add a function barrier - but both are needed to force vectorization)

vtjnash · 2021-04-23T17:56:42Z

This also fixes it. @Keno I think is about time we switched this to assuming subtyping and intersection are perfect, WDYT?

diff --git a/base/compiler/ssair/inlining.jl b/base/compiler/ssair/inlining.jl
index 0420d137da..328f1f0f3a 100644
--- a/base/compiler/ssair/inlining.jl
+++ b/base/compiler/ssair/inlining.jl
@@ -426,7 +426,7 @@ function ir_inline_item!(compact::IncrementalCompact, idx::Int, argexprs::Vector
     return_value
 end
 
-const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
+#const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
 
 function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
                                argexprs::Vector{Any}, linetable::Vector{LineInfoNode},
@@ -443,18 +443,20 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
         @assert !isa(metharg, UnionAll)
         cond = true
         @assert length(atype.parameters) == length(metharg.parameters)
-        for i in 1:length(atype.parameters)
-            a, m = atype.parameters[i], metharg.parameters[i]
-            # If this is always true, we don't need to check for it
-            a <: m && continue
-            # Generate isa check
-            isa_expr = Expr(:call, isa, argexprs[i], m)
-            ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
-            if cond === true
-                cond = ssa
-            else
-                and_expr = Expr(:call, and_int, cond, ssa)
-                cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+        if !item.fully_covered || next_cond_bb != generic_bb
+            for i in 1:length(atype.parameters)
+                a, m = atype.parameters[i], metharg.parameters[i]
+                # If this is always true, we don't need to check for it
+                a <: m && continue
+                # Generate isa check
+                isa_expr = Expr(:call, isa, argexprs[i], m)
+                ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
+                if cond === true
+                    cond = ssa
+                else
+                    and_expr = Expr(:call, and_int, cond, ssa)
+                    cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+                end
             end
         end
         insert_node_here!(compact, NewInstruction(GotoIfNot(cond, next_cond_bb), Union{}, line))
@@ -495,8 +497,8 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
     bb += 1
     # We're now in the fall through block, decide what to do
     if item.fully_covered
-        e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
-        insert_node_here!(compact, NewInstruction(e, Union{}, line))
+        #e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
+        #insert_node_here!(compact, NewInstruction(e, Union{}, line))
         insert_node_here!(compact, NewInstruction(ReturnNode(), Union{}, line))
         finish_current_bb!(compact, 0)
     else

ViralBShah · 2021-05-03T02:53:55Z

Bump.

ViralBShah · 2021-05-20T21:38:23Z

Would be nice to get this in.

oscardssmith · 2021-05-20T21:45:08Z

Is this just ready to merge?

bkamins · 2021-05-20T21:49:31Z

I assumed this should not be merged, but rather @vtjnash and @Keno would make a more general fix per the comment above and then this PR would be just closed as not needed.

ViralBShah · 2021-05-20T21:52:43Z

@bkamins Does this help with any of the recent DataFrames performance work and benchmarks? Trying to figure out if it is a nice to eventually have, or is it affecting performance sensitive things and should be addressed sooner.

bkamins · 2021-05-20T22:10:30Z

This does not affect DataFrames.jl directly. However, it affects any data science workflow (thus DataFrames.jl indirectly) as missing values are super common in these contexts. Another basic thing that has a similar problem is #40768 (but it is also non-critical).

What is holding writing JuliaLang/www.julialang.org#1282 are indeed performance issues, but they are related to #40840. However, I fear that #40840 will not be resolved quickly (if at all; and even if this is resolved the fact that GC in Julia is not multi-threading friendly will keep hitting us constantly in data science workflows).

Therefore @quinnj is now rewriting CSV.jl to avoid GC issues as much as possible. After new CSV.jl will be out I expect that in https://h2oai.github.io/db-benchmark/ the only thing that we are currently bad (i.e. joins on large data sets) will be much better. Essentially in 5GB join tests we hopefully should go down from 402 seconds to ~100 seconds (without changing anything in DataFrames.jl).

Then we will be ready to make announcements (of course there will still be room for improvements but they will be marginal - in general data.table and Polars are already very well tuned in terms of performance, so we should not expect to be able to be much better, but we essentially already have a comparable performance; here https://opendatascience.com/saying-hello-to-dataframes-jl/ is a small comparison that is not GC bound, as I am using integers only there).

ViralBShah · 2021-05-20T22:12:10Z

Thanks @bkamins - Very useful.

skanskan · 2021-05-25T11:44:58Z

@bkamins I don't understand what CSV.jl has to do with the performance of dataframes.jl in the joining benchmarks.
Isn't CSV.jl just used to read and write csv files?

bkamins · 2021-05-25T11:58:04Z

I will talk about it during my presentation during JuliaCon2021 😄 (which is now accepted).

The chain of consequences is:

CSV.jl currently reads in strings as String type
String type is not interned in Julia; all other packages fast like data.table or Polars do not have this problem (they use different approaches to read-in the data)
because it is not interned it takes part in garbage collection
because of this every time GC is invoked it has to scan through all String objects in memory doing marking (this is super slow as there can be billions of such small objects)
DataFrames.jl is currently multi-threaded, but garbage collection in Julia is single threaded, which means that when GC is invoked everything is frozen until garbage collection is finished

The consequence is that in H2O benchmarks of joins garbage collection is around 75% of all run-time of joining (before some changes that I did recently it was around 90% of all run time).

Now, how to resolve it:

improve GC in Julia Base (this is something that Julia core devs are aware of and I hope at some point it will be done - but I do not expect this to happen fast);
do not use String type to store strings;

And CSV.jl changes are exactly addressing the second point - @quinnj is developing changes that will provide two additional options of storing strings except for String type:

storing short strings in-line (so this is kind of like string interning in R, but better as it fully avoids allocations, the limitation is that only short strings can be handled this way - but short strings are most common strings in practice anyway - so this is super useful to have)
having one big allocated object and having individual strings be just pointers to its fragments (something like SubString type; this is an approach that e.g. Polars does as this roughly is the way Apache Arrow approaches to storing strings)

In short the answer is: String type is slowing us down because of various interlinked design decisions in Julia Base, so CSV.jl will avoid crating String objects to work around this issue.

Is this clearer now?

PallHaraldsson · 2021-07-19T22:46:09Z

This also fixes it.

@vtjnash, is that other fix (that I did not review) merged? Either way, can the PR here be merged, as it seems simple and to give a good speedup, while less general. Given the "forgetmenot" label, I would like to see either or both in Julia 1.7.

oscardssmith · 2021-07-20T05:54:26Z

We are way passed the point of getting this in 1.7. That said, we absolutely should try to get this merged for 1.8 if it's a good improvement.

bkamins · 2021-07-20T06:37:46Z

If @Keno and @vtjnash can provide a general solution (I understood from the discussion above that this is the case) then I think this PR is not needed. Otherwise I think count is a very fundamental function so it is better to have a custom half-measure solution for it than not have it.

timholy · 2021-07-20T07:07:36Z

One general solution is to force specialization:

function _simple_count(pred::F, itr, init::T) where {F,T}

You will likely need to do this on count too though worth checking (these things change over time).

To be clear, changes like these give you better runtime performance at the expense of worse latency. Often that's worth it, but just FYI.

oscardssmith · 2021-11-17T19:21:11Z

Adding this to the triage label so we can get discussion as to whether we like this. IMO we should merge it.

JeffBezanson · 2021-11-18T20:03:21Z

Let's merge this for now, and open a new issue for the compiler fix.

bkamins mentioned this pull request Apr 22, 2021

Improve the performance of describe() in the case of missing values. JuliaData/DataFrames.jl#2731

Open

bkamins commented Apr 22, 2021

View reviewed changes

base/reduce.jl Outdated Show resolved Hide resolved

Update base/reduce.jl

72651df

nalimilan added fold sum, maximum, reduce, foldl, etc. missing data Base.missing and related functionality performance Must go faster labels Apr 22, 2021

KristofferC reviewed Apr 22, 2021

View reviewed changes

vtjnash added the forget me not PRs that one wants to make sure aren't forgotten label May 29, 2021

vtjnash added compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) and removed fold sum, maximum, reduce, foldl, etc. labels Jul 19, 2021

PallHaraldsson mentioned this pull request Sep 1, 2021

Faster count JuliaData/DataFrames.jl#2849

Closed

oscardssmith added the triage This should be discussed on a triage call label Nov 18, 2021

JeffBezanson merged commit ebcce3f into JuliaLang:master Nov 18, 2021

JeffBezanson mentioned this pull request Nov 18, 2021

IR structure in count blocks vectorization #43138

Open

JeffBezanson removed triage This should be discussed on a triage call forget me not PRs that one wants to make sure aren't forgotten labels Nov 18, 2021

bkamins deleted the patch-27 branch November 18, 2021 20:56

vtjnash mentioned this pull request Jan 26, 2022

Add an option to not emit unionsplit fallback blocks #43923

Merged

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

Improve count performance (JuliaLang#40564)

6c2579b

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

Improve count performance (JuliaLang#40564)

6565db8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve count performance #40564

Improve count performance #40564

bkamins commented Apr 22, 2021

KristofferC commented Apr 22, 2021 •

edited

Loading

KristofferC Apr 22, 2021

bkamins Apr 22, 2021

KristofferC Apr 22, 2021

bkamins Apr 22, 2021

vtjnash commented Apr 23, 2021

ViralBShah commented May 3, 2021

ViralBShah commented May 20, 2021

oscardssmith commented May 20, 2021

bkamins commented May 20, 2021

ViralBShah commented May 20, 2021

bkamins commented May 20, 2021

ViralBShah commented May 20, 2021

skanskan commented May 25, 2021

bkamins commented May 25, 2021

PallHaraldsson commented Jul 19, 2021

oscardssmith commented Jul 20, 2021

bkamins commented Jul 20, 2021

timholy commented Jul 20, 2021

oscardssmith commented Nov 17, 2021

JeffBezanson commented Nov 18, 2021

Improve count performance #40564

Improve count performance #40564

Conversation

bkamins commented Apr 22, 2021

KristofferC commented Apr 22, 2021 • edited Loading

KristofferC Apr 22, 2021

Choose a reason for hiding this comment

bkamins Apr 22, 2021

Choose a reason for hiding this comment

KristofferC Apr 22, 2021

Choose a reason for hiding this comment

bkamins Apr 22, 2021

Choose a reason for hiding this comment

vtjnash commented Apr 23, 2021

ViralBShah commented May 3, 2021

ViralBShah commented May 20, 2021

oscardssmith commented May 20, 2021

bkamins commented May 20, 2021

ViralBShah commented May 20, 2021

bkamins commented May 20, 2021

ViralBShah commented May 20, 2021

skanskan commented May 25, 2021

bkamins commented May 25, 2021

PallHaraldsson commented Jul 19, 2021

oscardssmith commented Jul 20, 2021

bkamins commented Jul 20, 2021

timholy commented Jul 20, 2021

oscardssmith commented Nov 17, 2021

JeffBezanson commented Nov 18, 2021

KristofferC commented Apr 22, 2021 •

edited

Loading