Slow row aggregation in presence of missings #2757

sl-solution · 2021-05-10T04:08:42Z

The timing:

df = DataFrame(rand(10^5, 100),:auto);

@time combine(df, AsTable(:) => ByRow(sum))
  0.047518 seconds (357 allocations: 809.000 KiB)

allowmissing!(df)
@time combine(df, AsTable(:) => ByRow(sum))
  5.096357 seconds (19.80 M allocations: 15.195 GiB, 45.20% gc time)

The text was updated successfully, but these errors were encountered:

bkamins · 2021-05-10T08:45:37Z

We will not fix it in DataFrames.jl I have opened an issue in Julia Base for this as the problem is with sum implementation. I have opened JuliaLang/julia#40768 for this. We cannot do anything about it in DataFrames.jl.

For now - just drop ByRow to fix the problem.

pdeffebach · 2021-05-10T15:28:36Z

This is a good use for an AsVector wrapper, right? We shouldn't be making tuples this big anyways.

bkamins · 2021-05-10T15:31:25Z

Yes, c.f. #2440

pdeffebach · 2021-05-10T15:47:25Z

Great. Glad to know you've thought through it. Hopefully we can added it without making compilation times worse.

sl-solution · 2021-05-10T21:55:41Z

Great!
Just a heads up that somehow (I guess when table is wide plus some other conditions which I am not aware yet) I managed to crash combine(df, AsTable(:)=>sum). The first few lines of the output are:

Internal error: encountered unexpected error in runtime:
MethodError(f=Core.Compiler.widenconst, args=(:x34466,), world=0x00000000000010a8)
jl_method_error_bare at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
jl_method_error at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
jl_apply_generic at /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.dylib (unknown line)
getfield_elim_pass! at ./compiler/ssair/passes.jl:622
run_passes at ./compiler/ssair/driver.jl:133

and if I use ByRow(sum) the error will be:

ERROR: MethodError: no method matching widenconst(::Symbol)
Closest candidates are:
  widenconst(::Core.Compiler.Conditional) at compiler/typelattice.jl:228
  widenconst(::Core.Const) at compiler/typelattice.jl:229
  widenconst(::Core.Compiler.MaybeUndef) at compiler/typelattice.jl:239
....

Actually in some cases the ByRow() function can crash Julia itself!

bkamins · 2021-05-10T21:58:54Z

It is for sure not the fault of ByRow as it is super simple:

(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

but it is important to pinpoint the source of the errors and report back to Julia Base, so it would be great if you managed to create a reproducible example.

sl-solution · 2021-05-10T22:05:21Z

For the first case, try:

using BenchmarkTools, DataFrames
df = DataFrame(rand(10,10^5),:auto)
@btime combine(df, AsTable(:)=> sum)

bkamins · 2021-05-10T22:28:29Z

Ah - the problem is even when you want to run Tables.columntable(df). For such cases eachrow should be used (it does not mean that we should not try fixing the issue in Julia Base).

sl-solution · 2021-05-11T00:04:13Z

Actually, there is a way (surprising general and extremely fast) to solve eachrow slowness. Maybe using(adapting) it can solve a large class of problems.

df = DataFrame(rand(10,10^5),:auto)

op(x,y)= x .+= y
@btime mapreduce(identity, op, eachcol(df), init = zeros(nrow(df)))
1.681 ms (4 allocations: 240 bytes)

The op operator can be more complicated than this. Look at the example (run on Julia 1.6.1)

_op_bool_add(x::Bool,y::Bool) = x || y ? true : false
op(x,y) = x .= _op_bool_add.(x,ismissing.(y))

df = DataFrame(rand(10^5,10),:auto)
allowmissing!(df)
@btime completecases(df)
  114.472 μs (56 allocations: 67.34 KiB)
@btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
  42.686 μs (12 allocations: 114.52 KiB)


df = DataFrame(rand(10,10^5),:auto)
allowmissing!(df)
@btime completecases(df)
  72.058 ms (400004 allocations: 6.10 MiB)
 @btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
  2.381 ms (9 allocations: 368 bytes)

bkamins · 2021-05-11T06:04:45Z

Yes - this will work and is indeed a nice solution (additionally it is really easy to use multithreading in it using divide and conquer). The only limitation is that the operation you want to do must support reduction (which is often a case).

sl-solution · 2021-05-11T06:22:57Z

Currently I am working (is it faster than other methods??) to adapt this for sorting rows (and the same would be possible for grouped data, i.e. within groups), and that's very interesting operation for these sorts of implementations.

sl-solution · 2021-05-17T09:57:10Z

Currently I am working (is it faster than other methods??) to adapt this for sorting rows (and the same would be possible for grouped data, i.e. within groups), and that's very interesting operation for these sorts of implementations.

Unfortunately, the insertion algorithm for sorting killing the performance (for many cols or may rows within each group)!

bkamins · 2021-05-17T11:09:57Z

What are you trying to implement exactly? Maybe I can have a look (but in general it should be faster to pre-sort the data frame rather than sort it within-group)

sl-solution · 2021-05-18T07:27:27Z

What are you trying to implement exactly? Maybe I can have a look (but in general it should be faster to pre-sort the data frame rather than sort it within-group)

Like to find the most efficient way to combine(groupby(df, gcols), cols => sort) or select(df, AsTable(cols) => ByRow(sort)). For example in grouped data, using gdf.groups should be good, however, the only suitable algorithm (i can think) is insertion sort which is very slow.

bkamins · 2021-05-18T14:15:36Z

You can check to first copy only what needs to be sorted and then sort! it.

pdeffebach · 2021-05-18T14:23:56Z

Why not do sort(df, vcat(groupcols, cols))?

bkamins · 2021-05-18T16:59:25Z

because sort(df, vcat(groupcols, cols)) retains all columns of the data frame, while your expression above only keeps gcols and cols.

bkamins added this to the patch milestone May 10, 2021

bkamins added grouping performance ecosystem Issues in DataFrames.jl ecosystem and removed grouping labels May 10, 2021

bkamins removed this from the patch milestone May 10, 2021

bkamins closed this as completed May 10, 2021

sl-solution mentioned this issue May 19, 2021

Fast row aggregation in DataFrames.jl #2768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow row aggregation in presence of missings #2757

Slow row aggregation in presence of missings #2757

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

pdeffebach commented May 10, 2021

bkamins commented May 10, 2021

pdeffebach commented May 10, 2021

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

sl-solution commented May 11, 2021 •

edited

Loading

bkamins commented May 11, 2021

sl-solution commented May 11, 2021 •

edited

Loading

sl-solution commented May 17, 2021

bkamins commented May 17, 2021

sl-solution commented May 18, 2021

bkamins commented May 18, 2021

pdeffebach commented May 18, 2021

bkamins commented May 18, 2021

Slow row aggregation in presence of missings #2757

Slow row aggregation in presence of missings #2757

Comments

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

pdeffebach commented May 10, 2021

bkamins commented May 10, 2021

pdeffebach commented May 10, 2021

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

sl-solution commented May 10, 2021

bkamins commented May 10, 2021

sl-solution commented May 11, 2021 • edited Loading

bkamins commented May 11, 2021

sl-solution commented May 11, 2021 • edited Loading

sl-solution commented May 17, 2021

bkamins commented May 17, 2021

sl-solution commented May 18, 2021

bkamins commented May 18, 2021

pdeffebach commented May 18, 2021

bkamins commented May 18, 2021

sl-solution commented May 11, 2021 •

edited

Loading

sl-solution commented May 11, 2021 •

edited

Loading