-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow row aggregation in presence of missings #2757
Comments
We will not fix it in DataFrames.jl I have opened an issue in Julia Base for this as the problem is with For now - just drop |
This is a good use for an |
Yes, c.f. #2440 |
Great. Glad to know you've thought through it. Hopefully we can added it without making compilation times worse. |
Great!
and if I use
Actually in some cases the ByRow() function can crash |
It is for sure not the fault of
but it is important to pinpoint the source of the errors and report back to Julia Base, so it would be great if you managed to create a reproducible example. |
For the first case, try:
|
Ah - the problem is even when you want to run |
Actually, there is a way (surprising general and extremely fast) to solve df = DataFrame(rand(10,10^5),:auto)
op(x,y)= x .+= y
@btime mapreduce(identity, op, eachcol(df), init = zeros(nrow(df)))
1.681 ms (4 allocations: 240 bytes) The _op_bool_add(x::Bool,y::Bool) = x || y ? true : false
op(x,y) = x .= _op_bool_add.(x,ismissing.(y))
df = DataFrame(rand(10^5,10),:auto)
allowmissing!(df)
@btime completecases(df)
114.472 μs (56 allocations: 67.34 KiB)
@btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
42.686 μs (12 allocations: 114.52 KiB)
df = DataFrame(rand(10,10^5),:auto)
allowmissing!(df)
@btime completecases(df)
72.058 ms (400004 allocations: 6.10 MiB)
@btime .!mapreduce(identity, op, eachcol(df), init = zeros(Bool, nrow(df)))
2.381 ms (9 allocations: 368 bytes) |
Yes - this will work and is indeed a nice solution (additionally it is really easy to use multithreading in it using divide and conquer). The only limitation is that the operation you want to do must support reduction (which is often a case). |
Currently I am working (is it faster than other methods??) to adapt this for sorting rows (and the same would be possible for grouped data, i.e. within groups), and that's very interesting operation for these sorts of implementations. |
Unfortunately, the insertion algorithm for sorting killing the performance (for many cols or may rows within each group)! |
What are you trying to implement exactly? Maybe I can have a look (but in general it should be faster to pre-sort the data frame rather than sort it within-group) |
Like to find the most efficient way to |
You can check to first copy only what needs to be sorted and then |
Why not do |
because |
The timing:
The text was updated successfully, but these errors were encountered: