Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of describe() in the case of missing values. #2731

Open
sl-solution opened this issue Apr 21, 2021 · 25 comments
Open
Labels
Milestone

Comments

@sl-solution
Copy link

Congratulations on release 1.0.0

Regarding the missing values in the describe() function would it be possible to drop the skipmissing() function? The reason for this is purely performance wise. To demonstrate let me run some benchmarks (I assume the extreme case when all variables have missing values, other situation would be very large data set with few variables with missing values)

df = DataFrame(rand(10^5,100),:auto)
allowmissing!(df)
@btime describe(df,:mean);
  9.571 ms (1070 allocations: 87.75 KiB)

however if we used customised mean(), something like

Statistics.mean(f, ::AbstractArray{Missing,1}) = missing

function Statistics.mean(f, x::AbstractArray{Union{T,Missing},1}) where {T <: Number}
    _op(y1,y2) = (Base.add_sum(y1[1],y2[1]),Base.add_sum(y1[2],y2[2]))::Tuple{T,Int}
    _dmiss(y) = (ismissing(f(y)) ? zero(T) : f(y), !ismissing(f(y)))::Tuple{T,Bool}
    sval, n = mapreduce(_dmiss, _op, x)::Tuple{T, Int}
    n == 0 ? missing : sval/n
end

Statistics.mean(x::AbstractArray{Union{T,Missing},1}) where {T <: Number} = mean(identity, x)

then we half the running time:

@btime mapcols(mean, df);
  4.325 ms (232 allocations: 23.97 KiB)

Customising the std() function gives even more benefit, (in the following code I used a tweaked version of std())

@btime describe(df, :std);
  17.819 ms (1469 allocations: 94.02 KiB)
@btime mapcols(std, df);
  4.876 ms (232 allocations: 23.97 KiB)
@bkamins
Copy link
Member

bkamins commented Apr 21, 2021

See #2694 and the issued referenced there for a discussion.

The key question is if you need describe to be super fast? (I do not say it should not be, but it would be good to understand the use case before moving forward with the discussion)

@bkamins bkamins added this to the 1.x milestone Apr 21, 2021
@sl-solution
Copy link
Author

Just one thing to add is that the situation for median should be similar to std example, i.e. 4x faster, and in this case we can easily save seconds for medium to large data sets.

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

Now that I have slept over your issue I understand that the core of what you want to highlight is:

julia> x = rand(10^6);

julia> @btime mean($x);
  282.190 μs (0 allocations: 0 bytes)

julia> @btime mean(skipmissing($x));
  1.034 ms (0 allocations: 0 bytes)

And I agree that this is significant. However, this should be fixed in Statistics.jl and StatsBase.jl as this is a general performance hiccup - @nalimilan what do you think?

@sl-solution
Copy link
Author

I guess I can add a little to this: I think the skipmissing(), mean() ... are supposed to be general functions to work with many different data structures in many different situations, however, in DataFrames we are dealing with rectangle data (and in practice, mostly with numbers and/or string as eltype) and might be easier(?!) to optimise these, specially for mean, std, median,... which are very well defined in the context of data analysis.

@sl-solution
Copy link
Author

just to add an example to demonstrate my point let me go through a simple example. In the following code there is nothing wrong about the count() function, however, in the context of counting the number of missing values, we don't need to use it.

x = rand(10^6)
x = allowmissing(x)
@btime count(ismissing,x);
  501.475 μs (0 allocations: 0 bytes)
@btime mapreduce(ismissing,+,x)
 124.164 μs (0 allocations: 0 bytes)

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

This is a very good example, but I think it supports my claim if you investigate it further (I am re-running all to have timings on the same machine):

julia> @btime count(ismissing,$x);
  501.200 μs (0 allocations: 0 bytes)

julia> @btime mapreduce(ismissing,+,$x);
  87.300 μs (0 allocations: 0 bytes)

julia> @btime count(ismissing,$x,dims=1);
  84.600 μs (1 allocation: 96 bytes)

and as you can see the problem is with count implementation in Julia Base.

In general the design principle we want to stick to is to solve performance issues at their root. In this case ismissing, Number, and count are defined in Julia Base, so the performance problem should be solved in Julia Base.

Similarly mean is defined in Statistics.jl, so its performance should be solved there. If we started to do otherwise we would immediately hit the problems with bad consequences of type piracy.

Additionally - as shown above - as you can see - the problem with count is probably rather easy to fix in Julia Base.

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

Also:

julia> @btime count(Base.Generator(ismissing, x))
  71.000 μs (1 allocation: 16 bytes)

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

See JuliaLang/julia#40564

@nalimilan
Copy link
Member

I agree things should and probably can be improved in Julia rather than in DataFrames. The only change that could be appropriate in DataFrames is to call collect(skipmissing(x)) or view(x, .!ismissing.(x)) internally if paying this cost once is faster given that we call multiple functions (this is something that Base functions cannot do since they don't know this). But AFAICT that's not really the case (#2694).

@nalimilan
Copy link
Member

Also it's worth noting that mean is relatively complex as it ensures that accumulation is performed in the output type (https://github.com/JuliaLang/Statistics.jl/pull/25). Maybe that hurts performance with skipmissing and it could be simplified for simple cases like Float64.

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

Yes, but it should happen in Statistics.jl.

@sl-solution
Copy link
Author

Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.

@bkamins
Copy link
Member

bkamins commented Apr 22, 2021

Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.

Yes, but this should happen in general, not just for DataFrames.jl.

@sl-solution
Copy link
Author

Just a side question, is there any situation that some one working with DataFrame wouldn't like to deal with missings? (i.e. a scenario that f(x::Vector) return missing when any but not all of x is missing is desirable)

@bkamins
Copy link
Member

bkamins commented Apr 23, 2021

I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want missing to propagate like e.g. in:

julia> sum([1, missing])
missing

?

@sl-solution
Copy link
Author

I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want missing to propagate like e.g. in:

julia> sum([1, missing])
missing

?

yes, something like this.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2021

I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:

> sum(c(1, NA))
[1] NA
> mean(c(1, NA))
[1] NA

I guess the reason is to make sure that missing do not get silently ignored by the analyst, but rather consciously handled.

@sl-solution
Copy link
Author

I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:

> sum(c(1, NA))
[1] NA
> mean(c(1, NA))
[1] NA

I guess the reason is to make sure that missing do not get silently ignored by the analyst, but rather consciously handled.

It may not be commonly agreed standard, since I can not recall any other statistical software with similar behaviour. The problem with this approach to handle the missings is that we bother the analyst every time to make sure missings are not ignored. But a simple way to remind analyst this is to display the variables with missings differently which already DataFrames does by showing different column type.
My point with this is that skipmissing might not be only approach for DataFrames. Since there is no such a concept as "skipping missings". skipmissing is basically deleting observation with missing, this work for some functions like sum(although not efficiently currently) however, if someone likes to subtract each observation from the mean value, skip works only partially. DataFrames approach to missing values can be handling every thing behind the scene and just keep user informed(e.g. by displaying some notes about the producing missing during some tasks ).

@bkamins
Copy link
Member

bkamins commented Apr 24, 2021

Please keep in mind that DataFrames.jl does not - and will not - define any statistical functions. It is a package for managing tabular data. I see your point, but simply the functions you ask for are not and will not be defined in DataFrames.jl. They should be defined in separate packages and then they can just be used in DataFrames.jl as any other functions (and if you are willing to implement such a package it would be very interesting to see the comparison). E.g. the current behavior of sum is defined in Julia Base and it will not change till 2.0 release of Julia (and I assume it is unlikely that such change would be accepted by core Julia developers), so you need a separate package with a separate function definition that would do what you want (the good thing about Julia is that it should be relatively easy to do and it will be fast).

e.g. by displaying some notes about the producing missing during some tasks

This is something we will not do. DataFrames.jl does not display any messages when it does its operations. This package is designed for production use where the assumption is that such messages would be never seen. Messages are passed by function return values or errors thrown (and returning missing is exactly one of such messages).

@sl-solution
Copy link
Author

sl-solution commented Apr 24, 2021

I understand your points and I am ok with them, but I was thinking about something mild like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of Julia.

@nalimilan
Copy link
Member

You seem to assume that operations on data frame columns can be faster than on general AbstractArray or AbstractVector objects, but that's definitely not the case. And sum(skipmissing(x)) is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean is slower but I already developed this above).

I understand your points and I ok with them, but I was thinking about something like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of Julia.

I don't understand what this means.

@sl-solution
Copy link
Author

I don't understand what this means.

Sorry for confusion, let me elaborate this. In DataFrames when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.

... And sum(skipmissing(x)) is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean is slower but I already developed this above).

function sim_sum(x::Vector{Union{T, Missing}}) where T
           all(ismissing, x) && return missing
           _dmiss(y) = ismissing(y) ?  zero(T) :  y
           mapreduce(_dmiss, Base.add_sum, x)
  end
x=rand(10^6)
x=allowmissing(x)
x[rand(1:length(x),1000)].=missing
@btime sum(skipmissing($x))
  880.731 μs (5 allocations: 80 bytes)
499297.9721823642
@btime sim_sum(($x))
  328.393 μs (0 allocations: 0 bytes)
499297.9721823642

@bkamins
Copy link
Member

bkamins commented Apr 24, 2021

Yes, but this means that we should improve sum(skipmissing(x)) in Julia base. Thank you for pointing this out.

@nalimilan
Copy link
Member

Interesting. This doesn't happen with integers, only with floats. But this is more tricky than it seems due to the requirement to handle -0.0. See JuliaLang/julia#40599.

@nalimilan
Copy link
Member

Sorry for confusion, let me elaborate this. In DataFrames when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.

A special implementation is used because no equivalent operation exists in Base. That's not the case for things that describe does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants