Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve count performance #40564

Merged
merged 2 commits into from
Nov 18, 2021
Merged

Improve count performance #40564

merged 2 commits into from
Nov 18, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 22, 2021

This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler).

Here is an example of the performance difference:

julia> x = rand([1, missing], 10^6);

julia> @btime count(ismissing, $x)
  501.200 μs (0 allocations: 0 bytes)
500396

julia> @btime count(Base.Generator(ismissing, $x))
  70.800 μs (0 allocations: 0 bytes)
500396

(this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)

This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler).

Here is an example of the performance difference:
```
julia> x = rand([1, missing], 10^6);

julia> @Btime count(ismissing, $x)
  501.200 μs (0 allocations: 0 bytes)
500396

julia> @Btime count(Base.Generator(ismissing, $x))
  70.800 μs (0 allocations: 0 bytes)
500396
```
(this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)
@nalimilan nalimilan added fold sum, maximum, reduce, foldl, etc. missing data Base.missing and related functionality performance Must go faster labels Apr 22, 2021
@KristofferC
Copy link
Member

KristofferC commented Apr 22, 2021

Some notes, looking at:

@code_llvm Base._simple_count(ismissing, x, 0)

and

 @code_llvm Base._simple_count(identity, Base.Generator(ismissing, x), 0)

the second case manages to vectorize. The LLVM vectorization pass reports:

remark: reduce.jl:969:0: loop not vectorized: loop control flow is not understood by vectorizer
remark: reduce.jl:969:0: loop not vectorized: loop control flow is not understood by analyzer
remark: reduce.jl:969:0: loop not vectorized

in the first case but

remark: array.jl:777:0: vectorized loop (vectorization width: 4, interleaved count: 4)

in the second case.

for x in itr
n += pred(x)::Bool
for x in g
n += x::Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just n += true now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no pred can return true or false, e.g.:

julia> collect(Base.Generator(ismissing, [1,missing,2]))
3-element Vector{Bool}:
 0
 1
 0

and we still need ::Bool assertion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, somehow I thought the first argument was a filter with this syntax.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I was surprised that the compiler does not vectorize it (the only thing I do is introduce Generator wrapper and add a function barrier - but both are needed to force vectorization)

@vtjnash
Copy link
Member

vtjnash commented Apr 23, 2021

This also fixes it. @Keno I think is about time we switched this to assuming subtyping and intersection are perfect, WDYT?

diff --git a/base/compiler/ssair/inlining.jl b/base/compiler/ssair/inlining.jl
index 0420d137da..328f1f0f3a 100644
--- a/base/compiler/ssair/inlining.jl
+++ b/base/compiler/ssair/inlining.jl
@@ -426,7 +426,7 @@ function ir_inline_item!(compact::IncrementalCompact, idx::Int, argexprs::Vector
     return_value
 end
 
-const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
+#const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
 
 function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
                                argexprs::Vector{Any}, linetable::Vector{LineInfoNode},
@@ -443,18 +443,20 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
         @assert !isa(metharg, UnionAll)
         cond = true
         @assert length(atype.parameters) == length(metharg.parameters)
-        for i in 1:length(atype.parameters)
-            a, m = atype.parameters[i], metharg.parameters[i]
-            # If this is always true, we don't need to check for it
-            a <: m && continue
-            # Generate isa check
-            isa_expr = Expr(:call, isa, argexprs[i], m)
-            ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
-            if cond === true
-                cond = ssa
-            else
-                and_expr = Expr(:call, and_int, cond, ssa)
-                cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+        if !item.fully_covered || next_cond_bb != generic_bb
+            for i in 1:length(atype.parameters)
+                a, m = atype.parameters[i], metharg.parameters[i]
+                # If this is always true, we don't need to check for it
+                a <: m && continue
+                # Generate isa check
+                isa_expr = Expr(:call, isa, argexprs[i], m)
+                ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
+                if cond === true
+                    cond = ssa
+                else
+                    and_expr = Expr(:call, and_int, cond, ssa)
+                    cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+                end
             end
         end
         insert_node_here!(compact, NewInstruction(GotoIfNot(cond, next_cond_bb), Union{}, line))
@@ -495,8 +497,8 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
     bb += 1
     # We're now in the fall through block, decide what to do
     if item.fully_covered
-        e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
-        insert_node_here!(compact, NewInstruction(e, Union{}, line))
+        #e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
+        #insert_node_here!(compact, NewInstruction(e, Union{}, line))
         insert_node_here!(compact, NewInstruction(ReturnNode(), Union{}, line))
         finish_current_bb!(compact, 0)
     else

@ViralBShah
Copy link
Member

Bump.

@ViralBShah
Copy link
Member

Would be nice to get this in.

@oscardssmith
Copy link
Member

Is this just ready to merge?

@bkamins
Copy link
Member Author

bkamins commented May 20, 2021

I assumed this should not be merged, but rather @vtjnash and @Keno would make a more general fix per the comment above and then this PR would be just closed as not needed.

@ViralBShah
Copy link
Member

@bkamins Does this help with any of the recent DataFrames performance work and benchmarks? Trying to figure out if it is a nice to eventually have, or is it affecting performance sensitive things and should be addressed sooner.

@bkamins
Copy link
Member Author

bkamins commented May 20, 2021

This does not affect DataFrames.jl directly. However, it affects any data science workflow (thus DataFrames.jl indirectly) as missing values are super common in these contexts. Another basic thing that has a similar problem is #40768 (but it is also non-critical).

What is holding writing JuliaLang/www.julialang.org#1282 are indeed performance issues, but they are related to #40840. However, I fear that #40840 will not be resolved quickly (if at all; and even if this is resolved the fact that GC in Julia is not multi-threading friendly will keep hitting us constantly in data science workflows).

Therefore @quinnj is now rewriting CSV.jl to avoid GC issues as much as possible. After new CSV.jl will be out I expect that in https://h2oai.github.io/db-benchmark/ the only thing that we are currently bad (i.e. joins on large data sets) will be much better. Essentially in 5GB join tests we hopefully should go down from 402 seconds to ~100 seconds (without changing anything in DataFrames.jl).

Then we will be ready to make announcements (of course there will still be room for improvements but they will be marginal - in general data.table and Polars are already very well tuned in terms of performance, so we should not expect to be able to be much better, but we essentially already have a comparable performance; here https://opendatascience.com/saying-hello-to-dataframes-jl/ is a small comparison that is not GC bound, as I am using integers only there).

@ViralBShah
Copy link
Member

Thanks @bkamins - Very useful.

@skanskan
Copy link

@bkamins I don't understand what CSV.jl has to do with the performance of dataframes.jl in the joining benchmarks.
Isn't CSV.jl just used to read and write csv files?

@bkamins
Copy link
Member Author

bkamins commented May 25, 2021

I will talk about it during my presentation during JuliaCon2021 😄 (which is now accepted).

The chain of consequences is:

  • CSV.jl currently reads in strings as String type
  • String type is not interned in Julia; all other packages fast like data.table or Polars do not have this problem (they use different approaches to read-in the data)
  • because it is not interned it takes part in garbage collection
  • because of this every time GC is invoked it has to scan through all String objects in memory doing marking (this is super slow as there can be billions of such small objects)
  • DataFrames.jl is currently multi-threaded, but garbage collection in Julia is single threaded, which means that when GC is invoked everything is frozen until garbage collection is finished

The consequence is that in H2O benchmarks of joins garbage collection is around 75% of all run-time of joining (before some changes that I did recently it was around 90% of all run time).

Now, how to resolve it:

  • improve GC in Julia Base (this is something that Julia core devs are aware of and I hope at some point it will be done - but I do not expect this to happen fast);
  • do not use String type to store strings;

And CSV.jl changes are exactly addressing the second point - @quinnj is developing changes that will provide two additional options of storing strings except for String type:

  • storing short strings in-line (so this is kind of like string interning in R, but better as it fully avoids allocations, the limitation is that only short strings can be handled this way - but short strings are most common strings in practice anyway - so this is super useful to have)
  • having one big allocated object and having individual strings be just pointers to its fragments (something like SubString type; this is an approach that e.g. Polars does as this roughly is the way Apache Arrow approaches to storing strings)

In short the answer is: String type is slowing us down because of various interlinked design decisions in Julia Base, so CSV.jl will avoid crating String objects to work around this issue.

Is this clearer now?

@vtjnash vtjnash added the forget me not PRs that one wants to make sure aren't forgotten label May 29, 2021
@vtjnash vtjnash added compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) and removed fold sum, maximum, reduce, foldl, etc. labels Jul 19, 2021
@PallHaraldsson
Copy link
Contributor

This also fixes it.

@vtjnash, is that other fix (that I did not review) merged? Either way, can the PR here be merged, as it seems simple and to give a good speedup, while less general. Given the "forgetmenot" label, I would like to see either or both in Julia 1.7.

@oscardssmith
Copy link
Member

We are way passed the point of getting this in 1.7. That said, we absolutely should try to get this merged for 1.8 if it's a good improvement.

@bkamins
Copy link
Member Author

bkamins commented Jul 20, 2021

If @Keno and @vtjnash can provide a general solution (I understood from the discussion above that this is the case) then I think this PR is not needed. Otherwise I think count is a very fundamental function so it is better to have a custom half-measure solution for it than not have it.

@timholy
Copy link
Member

timholy commented Jul 20, 2021

One general solution is to force specialization:

function _simple_count(pred::F, itr, init::T) where {F,T}

You will likely need to do this on count too though worth checking (these things change over time).

To be clear, changes like these give you better runtime performance at the expense of worse latency. Often that's worth it, but just FYI.

@oscardssmith
Copy link
Member

Adding this to the triage label so we can get discussion as to whether we like this. IMO we should merge it.

@oscardssmith oscardssmith added the triage This should be discussed on a triage call label Nov 18, 2021
@JeffBezanson
Copy link
Member

Let's merge this for now, and open a new issue for the compiler fix.

@JeffBezanson JeffBezanson merged commit ebcce3f into JuliaLang:master Nov 18, 2021
@JeffBezanson JeffBezanson removed triage This should be discussed on a triage call forget me not PRs that one wants to make sure aren't forgotten labels Nov 18, 2021
@bkamins bkamins deleted the patch-27 branch November 18, 2021 20:56
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) missing data Base.missing and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants