Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow row aggregation in presence of missings #2757

Closed
sl-solution opened this issue May 10, 2021 · 17 comments
Closed

Slow row aggregation in presence of missings #2757

sl-solution opened this issue May 10, 2021 · 17 comments
Labels
ecosystem Issues in DataFrames.jl ecosystem performance

Comments

@sl-solution
Copy link

The timing:

df = DataFrame(rand(10^5, 100),:auto);

@time combine(df, AsTable(:) => ByRow(sum))
  0.047518 seconds (357 allocations: 809.000 KiB)

allowmissing!(df)
@time combine(df, AsTable(:) => ByRow(sum))
  5.096357 seconds (19.80 M allocations: 15.195 GiB, 45.20% gc time)
@bkamins bkamins added this to the patch milestone May 10, 2021
@bkamins bkamins added grouping performance ecosystem Issues in DataFrames.jl ecosystem and removed grouping labels May 10, 2021
@bkamins bkamins removed this from the patch milestone May 10, 2021
@bkamins
Copy link
Member

bkamins commented May 10, 2021

We will not fix it in DataFrames.jl I have opened an issue in Julia Base for this as the problem is with sum implementation. I have opened JuliaLang/julia#40768 for this. We cannot do anything about it in DataFrames.jl.

For now - just drop ByRow to fix the problem.

@bkamins bkamins closed this as completed May 10, 2021
@pdeffebach
Copy link
Contributor

This is a good use for an AsVector wrapper, right? We shouldn't be making tuples this big anyways.

@bkamins
Copy link
Member

bkamins commented May 10, 2021

Yes, c.f. #2440

@pdeffebach
Copy link
Contributor

Great. Glad to know you've thought through it. Hopefully we can added it without making compilation times worse.

@sl-solution
Copy link
Author

Great!
Just a heads up that somehow (I guess when table is wide plus some other conditions which I am not aware yet) I managed to crash combine(df, AsTable(:)=>sum). The first few lines of the output are:

Internal error: encountered unexpected error in runtime:
MethodError(f=Core.Compiler.widenconst, args=(:x34466,), world=0x00000000000010a8)
jl_method_error_bare at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
jl_method_error at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
jl_apply_generic at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
getfield_elim_pass! at ./compiler/ssair/passes.jl:622
run_passes at ./compiler/ssair/driver.jl:133

and if I use ByRow(sum) the error will be:

ERROR: MethodError: no method matching widenconst(::Symbol)
Closest candidates are:
  widenconst(::Core.Compiler.Conditional) at compiler/typelattice.jl:228
  widenconst(::Core.Const) at compiler/typelattice.jl:229
  widenconst(::Core.Compiler.MaybeUndef) at compiler/typelattice.jl:239
....

Actually in some cases the ByRow() function can crash Julia itself!

@bkamins
Copy link
Member

bkamins commented May 10, 2021

It is for sure not the fault of ByRow as it is super simple:

(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

but it is important to pinpoint the source of the errors and report back to Julia Base, so it would be great if you managed to create a reproducible example.

@sl-solution
Copy link
Author

For the first case, try:

using BenchmarkTools, DataFrames
df = DataFrame(rand(10,10^5),:auto)
@btime combine(df, AsTable(:)=> sum)

@bkamins
Copy link
Member

bkamins commented May 10, 2021

Ah - the problem is even when you want to run Tables.columntable(df). For such cases eachrow should be used (it does not mean that we should not try fixing the issue in Julia Base).

@sl-solution
Copy link
Author

sl-solution commented May 11, 2021

Actually, there is a way (surprising general and extremely fast) to solve eachrow slowness. Maybe using(adapting) it can solve a large class of problems.

df = DataFrame(rand(10,10^5),:auto)

op(x,y)= x .+= y
@btime mapreduce(identity, op, eachcol(df), init = zeros(nrow(df)))
1.681 ms (4 allocations: 240 bytes)

The op operator can be more complicated than this. Look at the example (run on Julia 1.6.1)

_op_bool_add(x::Bool,y::Bool) = x || y ? true : false
op(x,y) = x .= _op_bool_add.(x,ismissing.(y))

df = DataFrame(rand(10^5,10),:auto)
allowmissing!(df)
@btime completecases(df)
  114.472 μs (56 allocations: 67.34 KiB)
@btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
  42.686 μs (12 allocations: 114.52 KiB)


df = DataFrame(rand(10,10^5),:auto)
allowmissing!(df)
@btime completecases(df)
  72.058 ms (400004 allocations: 6.10 MiB)
 @btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
  2.381 ms (9 allocations: 368 bytes)

@bkamins
Copy link
Member

bkamins commented May 11, 2021

Yes - this will work and is indeed a nice solution (additionally it is really easy to use multithreading in it using divide and conquer). The only limitation is that the operation you want to do must support reduction (which is often a case).

@sl-solution
Copy link
Author

sl-solution commented May 11, 2021

Currently I am working (is it faster than other methods??) to adapt this for sorting rows (and the same would be possible for grouped data, i.e. within groups), and that's very interesting operation for these sorts of implementations.

@sl-solution
Copy link
Author

Currently I am working (is it faster than other methods??) to adapt this for sorting rows (and the same would be possible for grouped data, i.e. within groups), and that's very interesting operation for these sorts of implementations.

Unfortunately, the insertion algorithm for sorting killing the performance (for many cols or may rows within each group)!

@bkamins
Copy link
Member

bkamins commented May 17, 2021

What are you trying to implement exactly? Maybe I can have a look (but in general it should be faster to pre-sort the data frame rather than sort it within-group)

@sl-solution
Copy link
Author

What are you trying to implement exactly? Maybe I can have a look (but in general it should be faster to pre-sort the data frame rather than sort it within-group)

Like to find the most efficient way to combine(groupby(df, gcols), cols => sort) or select(df, AsTable(cols) => ByRow(sort)). For example in grouped data, using gdf.groups should be good, however, the only suitable algorithm (i can think) is insertion sort which is very slow.

@bkamins
Copy link
Member

bkamins commented May 18, 2021

You can check to first copy only what needs to be sorted and then sort! it.

@pdeffebach
Copy link
Contributor

Why not do sort(df, vcat(groupcols, cols))?

@bkamins
Copy link
Member

bkamins commented May 18, 2021

because sort(df, vcat(groupcols, cols)) retains all columns of the data frame, while your expression above only keeps gcols and cols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem performance
Projects
None yet
Development

No branches or pull requests

3 participants