-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve count performance #40564
Improve count performance #40564
Conversation
This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler). Here is an example of the performance difference: ``` julia> x = rand([1, missing], 10^6); julia> @Btime count(ismissing, $x) 501.200 μs (0 allocations: 0 bytes) 500396 julia> @Btime count(Base.Generator(ismissing, $x)) 70.800 μs (0 allocations: 0 bytes) 500396 ``` (this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)
Some notes, looking at: @code_llvm Base._simple_count(ismissing, x, 0) and @code_llvm Base._simple_count(identity, Base.Generator(ismissing, x), 0) the second case manages to vectorize. The LLVM vectorization pass reports:
in the first case but
in the second case. |
for x in itr | ||
n += pred(x)::Bool | ||
for x in g | ||
n += x::Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just n += true
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no pred
can return true
or false
, e.g.:
julia> collect(Base.Generator(ismissing, [1,missing,2]))
3-element Vector{Bool}:
0
1
0
and we still need ::Bool
assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, somehow I thought the first argument was a filter with this syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I was surprised that the compiler does not vectorize it (the only thing I do is introduce Generator
wrapper and add a function barrier - but both are needed to force vectorization)
This also fixes it. @Keno I think is about time we switched this to assuming subtyping and intersection are perfect, WDYT? diff --git a/base/compiler/ssair/inlining.jl b/base/compiler/ssair/inlining.jl
index 0420d137da..328f1f0f3a 100644
--- a/base/compiler/ssair/inlining.jl
+++ b/base/compiler/ssair/inlining.jl
@@ -426,7 +426,7 @@ function ir_inline_item!(compact::IncrementalCompact, idx::Int, argexprs::Vector
return_value
end
-const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
+#const fatal_type_bound_error = ErrorException("fatal error in type inference (type bound)")
function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
argexprs::Vector{Any}, linetable::Vector{LineInfoNode},
@@ -443,18 +443,20 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
@assert !isa(metharg, UnionAll)
cond = true
@assert length(atype.parameters) == length(metharg.parameters)
- for i in 1:length(atype.parameters)
- a, m = atype.parameters[i], metharg.parameters[i]
- # If this is always true, we don't need to check for it
- a <: m && continue
- # Generate isa check
- isa_expr = Expr(:call, isa, argexprs[i], m)
- ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
- if cond === true
- cond = ssa
- else
- and_expr = Expr(:call, and_int, cond, ssa)
- cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+ if !item.fully_covered || next_cond_bb != generic_bb
+ for i in 1:length(atype.parameters)
+ a, m = atype.parameters[i], metharg.parameters[i]
+ # If this is always true, we don't need to check for it
+ a <: m && continue
+ # Generate isa check
+ isa_expr = Expr(:call, isa, argexprs[i], m)
+ ssa = insert_node_here!(compact, NewInstruction(isa_expr, Bool, line))
+ if cond === true
+ cond = ssa
+ else
+ and_expr = Expr(:call, and_int, cond, ssa)
+ cond = insert_node_here!(compact, NewInstruction(and_expr, Bool, line))
+ end
end
end
insert_node_here!(compact, NewInstruction(GotoIfNot(cond, next_cond_bb), Union{}, line))
@@ -495,8 +497,8 @@ function ir_inline_unionsplit!(compact::IncrementalCompact, idx::Int,
bb += 1
# We're now in the fall through block, decide what to do
if item.fully_covered
- e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
- insert_node_here!(compact, NewInstruction(e, Union{}, line))
+ #e = Expr(:call, GlobalRef(Core, :throw), fatal_type_bound_error)
+ #insert_node_here!(compact, NewInstruction(e, Union{}, line))
insert_node_here!(compact, NewInstruction(ReturnNode(), Union{}, line))
finish_current_bb!(compact, 0)
else |
Bump. |
Would be nice to get this in. |
Is this just ready to merge? |
@bkamins Does this help with any of the recent DataFrames performance work and benchmarks? Trying to figure out if it is a nice to eventually have, or is it affecting performance sensitive things and should be addressed sooner. |
This does not affect DataFrames.jl directly. However, it affects any data science workflow (thus DataFrames.jl indirectly) as What is holding writing JuliaLang/www.julialang.org#1282 are indeed performance issues, but they are related to #40840. However, I fear that #40840 will not be resolved quickly (if at all; and even if this is resolved the fact that GC in Julia is not multi-threading friendly will keep hitting us constantly in data science workflows). Therefore @quinnj is now rewriting CSV.jl to avoid GC issues as much as possible. After new CSV.jl will be out I expect that in https://h2oai.github.io/db-benchmark/ the only thing that we are currently bad (i.e. joins on large data sets) will be much better. Essentially in 5GB join tests we hopefully should go down from 402 seconds to ~100 seconds (without changing anything in DataFrames.jl). Then we will be ready to make announcements (of course there will still be room for improvements but they will be marginal - in general data.table and Polars are already very well tuned in terms of performance, so we should not expect to be able to be much better, but we essentially already have a comparable performance; here https://opendatascience.com/saying-hello-to-dataframes-jl/ is a small comparison that is not GC bound, as I am using integers only there). |
Thanks @bkamins - Very useful. |
@bkamins I don't understand what CSV.jl has to do with the performance of dataframes.jl in the joining benchmarks. |
I will talk about it during my presentation during JuliaCon2021 😄 (which is now accepted). The chain of consequences is:
The consequence is that in H2O benchmarks of joins garbage collection is around 75% of all run-time of joining (before some changes that I did recently it was around 90% of all run time). Now, how to resolve it:
And CSV.jl changes are exactly addressing the second point - @quinnj is developing changes that will provide two additional options of storing strings except for
In short the answer is: Is this clearer now? |
@vtjnash, is that other fix (that I did not review) merged? Either way, can the PR here be merged, as it seems simple and to give a good speedup, while less general. Given the "forgetmenot" label, I would like to see either or both in Julia 1.7. |
We are way passed the point of getting this in 1.7. That said, we absolutely should try to get this merged for 1.8 if it's a good improvement. |
One general solution is to force specialization: function _simple_count(pred::F, itr, init::T) where {F,T} You will likely need to do this on To be clear, changes like these give you better runtime performance at the expense of worse latency. Often that's worth it, but just FYI. |
Adding this to the triage label so we can get discussion as to whether we like this. IMO we should merge it. |
Let's merge this for now, and open a new issue for the compiler fix. |
This is a speculative PR as I do not fully understand why the original code was not properly optimized by the compiler (so maybe a better fix is to change compiler).
Here is an example of the performance difference:
(this is run on 1.6, but this PR is essentially doing the same by introducing a function barrier)