-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of describe() in the case of missing values. #2731
Comments
See #2694 and the issued referenced there for a discussion. The key question is if you need |
Just one thing to add is that the situation for median should be similar to |
Now that I have slept over your issue I understand that the core of what you want to highlight is:
And I agree that this is significant. However, this should be fixed in Statistics.jl and StatsBase.jl as this is a general performance hiccup - @nalimilan what do you think? |
I guess I can add a little to this: I think the skipmissing(), mean() ... are supposed to be general functions to work with many different data structures in many different situations, however, in DataFrames we are dealing with rectangle data (and in practice, mostly with numbers and/or string as eltype) and might be easier(?!) to optimise these, specially for mean, std, median,... which are very well defined in the context of data analysis. |
just to add an example to demonstrate my point let me go through a simple example. In the following code there is nothing wrong about the count() function, however, in the context of counting the number of missing values, we don't need to use it. x = rand(10^6)
x = allowmissing(x)
@btime count(ismissing,x);
501.475 μs (0 allocations: 0 bytes)
@btime mapreduce(ismissing,+,x)
124.164 μs (0 allocations: 0 bytes) |
This is a very good example, but I think it supports my claim if you investigate it further (I am re-running all to have timings on the same machine):
and as you can see the problem is with In general the design principle we want to stick to is to solve performance issues at their root. In this case Similarly Additionally - as shown above - as you can see - the problem with |
Also:
|
I agree things should and probably can be improved in Julia rather than in DataFrames. The only change that could be appropriate in DataFrames is to call |
Also it's worth noting that |
Yes, but it should happen in Statistics.jl. |
Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values. |
Yes, but this should happen in general, not just for DataFrames.jl. |
Just a side question, is there any situation that some one working with DataFrame wouldn't like to deal with missings? (i.e. a scenario that f(x::Vector) return missing when any but not all of x is missing is desirable) |
I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want
? |
yes, something like this. |
I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:
I guess the reason is to make sure that |
It may not be commonly agreed standard, since I can not recall any other statistical software with similar behaviour. The problem with this approach to handle the |
Please keep in mind that DataFrames.jl does not - and will not - define any statistical functions. It is a package for managing tabular data. I see your point, but simply the functions you ask for are not and will not be defined in DataFrames.jl. They should be defined in separate packages and then they can just be used in DataFrames.jl as any other functions (and if you are willing to implement such a package it would be very interesting to see the comparison). E.g. the current behavior of
This is something we will not do. DataFrames.jl does not display any messages when it does its operations. This package is designed for production use where the assumption is that such messages would be never seen. Messages are passed by function return values or errors thrown (and returning |
I understand your points and I am ok with them, but I was thinking about something mild like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of |
You seem to assume that operations on data frame columns can be faster than on general
I don't understand what this means. |
Sorry for confusion, let me elaborate this. In
function sim_sum(x::Vector{Union{T, Missing}}) where T
all(ismissing, x) && return missing
_dmiss(y) = ismissing(y) ? zero(T) : y
mapreduce(_dmiss, Base.add_sum, x)
end
x=rand(10^6)
x=allowmissing(x)
x[rand(1:length(x),1000)].=missing
@btime sum(skipmissing($x))
880.731 μs (5 allocations: 80 bytes)
499297.9721823642
@btime sim_sum(($x))
328.393 μs (0 allocations: 0 bytes)
499297.9721823642 |
Yes, but this means that we should improve |
Interesting. This doesn't happen with integers, only with floats. But this is more tricky than it seems due to the requirement to handle |
A special implementation is used because no equivalent operation exists in Base. That's not the case for things that |
Congratulations on release 1.0.0
Regarding the missing values in the describe() function would it be possible to drop the skipmissing() function? The reason for this is purely performance wise. To demonstrate let me run some benchmarks (I assume the extreme case when all variables have missing values, other situation would be very large data set with few variables with missing values)
however if we used customised mean(), something like
then we half the running time:
Customising the std() function gives even more benefit, (in the following code I used a tweaked version of std())
The text was updated successfully, but these errors were encountered: