-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast row aggregation in DataFrames.jl #2768
Comments
This makes a lot of sense. However, if you allow 😄 (given you have already put a lot of work into it) in DataFrames.jl I would do the following (this is easy to do, and is more extensible). I will give an example of only When user writes:
or (if handling missing values is required)
then we can intercept
so we can dispatch on it and provide an efficient aggregation path both for The only consideration is that if we want to be consistent we probably should first introduce
as The benefit of this approach, over your proposal, is that we do not introduce numerous new verbs. We use what we currently have and only need to add What do you think about it? |
Handling
Adding more verbs, i.e. |
Could you please expand on your comment where you see the conflict? What I mean is to use |
I was thinking about it a bit more. Actually your package might make sense to have on its own. Essentially it could drop a dependency on DataFrames.jl and support any Tables.jl compliant source. Then it would be more general. I feel that some people might then find it useful for their workflows that do not use DataFrames.jl. |
In summary I would propose the following plan:
A particular comment that e.g. that:
will already be fast without any special handling, as |
I was thinking about this, however, this create one issue which I don't know how to avoid it: Using |
I think that would not be that much simple (?), e.g. |
Let us see what other maintainers say, but my feeling is that we will not allow for adding that many
Yes - this will not be able to use the fast path and this would be a limitation of this approach. This is the same we have in fast aggregation. However, if in the future Julia Base would support a nicer support for currying it would be very easy to add:
just right now
Yes, it would not be optimal, but it would already be relatively good. As commented - we could always add a special path to it. |
Maybe let me add another general comment. In DataFrames.jl we try to have a different design approach than eg. Pandas has. We want to keep the body of the package as minimal as possible, but making sure that the core is flexible enough to cover user's needs. The assumption is that satellite packages (or even some simple Julia Base code) can cover for the rest. The idea is that we do not need to bake in into DataFrames.jl all the functionality because of speed (as opposed to e.g. Pandas where you have to call optimized C code to be fast, so all has to be included). This is a consideration I have in mind when discussing this issue. This is a different situation than e.g. your #2743 proposal. In #2743 we know that the In summary both this proposal and #2743 hit important things we want to focus on adding post 1.0 release. However, both issues - for different reasons - are not simple decisions to make and quickly merge new functionality. Let me also give you the following perspective: the proposals by @pstorozenko (a new contributor) of #2726 and #2727 were easy decisions to make a PR and merge as they were both clear improvements that, however, do not affect user facing API. Everything that affects exposed API will be processed much longer. We are after 1.0 release (which we had to do although we knew that there are things to still work on). But being after 1.0 release means that we cannot loosely experiment with user-facing API any more. Whatever we add must be very well thought of so that we are 100% sure it will not change in the future. |
Discussing about this issue will be eventually helpful and we can end up with a solid solution.
I guess this wouldn't be an issue, since a suitable design can solve this (?) (although we get used to this kind of verbs in
I think |
I fully agree with @bkamins. We should keep the API both minimal and flexible. The strength of DataFrames.jl is that it leverages the fact that Julia allows writing custom code that is fast, so that we don't need to provide dozens of special methods. When particular optimizations are needed, they should happen under the hood, while users still use the more general syntax. |
I explored the idea a little more and defined a function called julia> df = DataFrame(g = [1, 1, 1, 2, 2],
x1_int = [0, 0, 1, missing, 2],
x2_int = [3, 2, 1, 3, -2],
x1_float = [1.2, missing, -1.0, 2.3, 10],
x2_float = [missing, missing, 3.0, missing, missing],
x3_float = [missing, missing, -1.4, 3.0, -100.0])
5×6 DataFrame
Row │ g x1_int x2_int x1_float x2_float x3_float
│ Int64 Int64? Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ 1 0 3 1.2 missing missing
2 │ 1 0 2 missing missing missing
3 │ 1 1 1 -1.0 3.0 -1.4
4 │ 2 missing 3 2.3 missing 3.0
5 │ 2 2 -2 10.0 missing -100.0
julia> byrow(sum, df, r"x")
5-element Vector{Union{Missing, Float64}}:
4.2
2.0
2.6
8.3
-90.0
julia> byrow(sum, df, r"x", by = abs)
5-element Vector{Union{Missing, Float64}}:
4.2
2.0
7.4
8.3
114.0
This can be fit in the current design by a little change. For example in julia> select(df, :, r"x" => byrow(sum, by = abs) => :total)
5×7 DataFrame
Row │ g x1_int x2_int x1_float x2_float x3_float total
│ Int64 Int64? Int64 Float64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────────
1 │ 1 0 3 1.2 missing missing 4.2
2 │ 1 0 2 missing missing missing 2.0
3 │ 1 1 1 -1.0 3.0 -1.4 7.4
4 │ 2 missing 3 2.3 missing 3.0 8.3
5 │ 2 2 -2 10.0 missing -100.0 114.0 |
Rewriting it a bit, we could - instead of adding
and
which would efficiently call This would be a bit less flexible than So we have three designs on the table:
Let us wait for other users/contributors to comment on what they think |
I updated
|
I've been following this conversation an I have the following thoughts. The main problem is that In other words, the problem is not that we don't have a good way to
I think As for DataFramesMeta, we still need to implement |
If we go for this option, the implementation I proposed some time ago would make In summary the tension is:
What we could have is defune However, let me think a bit more about it. For now I think that |
Just one thing to think is that AsVector and AsTable will be inefficient for many columns few rows and many rows and few cols, e.g. (I used sum for simplicity - and note that I haven't skipmissing for AsVector and AsTable) df = DataFrame(randn(10^7,10), :auto)
myfun(x) = x[1]<0 ? x = round.(Int,x) : x
mapcols!(myfun, df)
allowmissing!(df, r"1")
df[1,1]=missing
@time byrow(sum, df)
0.204496 seconds (263 allocations: 85.845 MiB)
fsum(df) = [sum(x) for x in VectorIterator(df)]
@time fsum(df)
6.876729 seconds (259.99 M allocations: 3.958 GiB, 8.20% gc time)
@time combine(df , AsTable(:)=>ByRow(sum))
9.570441 seconds (250.00 M allocations: 11.856 GiB, 20.97% gc time) |
The implementation of Secondly - |
I agree. the code for Skipping |
|
I think one other note, @sl-solution, is that if we go with It seems unlikely DataFrames would implement all of your really finely-tuned functions, but if it were super easy to use them inside a Something like
|
😃 I checked indeed, but it is about the same timing as what I report in the last comment.
one question: is |
As for row operations. Yeah, similarly, a
I just want to clarify this one point,
|
Just to add - as @pdeffebach commented. The design API in DataFrames.jl is driven by the following considerations:
The point is that DataFrames.jl was never intended to be a super fast package - its main point was convenience. If someone wants super fast operations then probably other data structures are better suited than Having said that it does not mean that we do not want to have a good speed - we want to. Just the mental model of development is that we first need to assure that what we provide is consistent with the whole JuliaData ecosystem, and only then we think how to make it fast. With row aggregations we are currently in the state of "embarassingly slow execution" with the API we have now, so clearly this should be fixed but we need to do it in a way that is consistent with the ecosystem and easy to maintain in the long run. That is why we are hesitating with making a decision. |
But, thankfully, it is not Vector of Vector 😃 one thing to think is |
I totally understand it and it makes sense. The current implementation of |
Sorry, just to clarify, I mean a
I agree that the API of I see the following way forward, and I'm wondering if you can agree
(I'm sure there are other Tables.jl functions that would make this more composable). But despite allocating a vector in the outer function, the timing of this function is comparable to I say "optionally" because we have a syntax for EDIT: There are some complicated concerns here about typed vs untyped tables and when a |
un-typed version of |
Having the basic functions in DataFrames.jl and extra functions in DFRowOperations.jl also makes sense to me. Actually I think we are getting to the point that DataFramesMeta.jl could be accompanied by DataFramesKit.jl (and the second would hold extra functions that users find useful, but we are hesitant to add hem to the core of package to keep it lightweight). The benefit is that we could make DataFramesKit.jl 0.x and clearly indicate that the API might evolve (as opposed to DataFrames.jl). |
I think
the current design of
Working with individual columns of I am not fan of other package, since row operations are, in some sense, essential for
|
It would be interesting if you expanded on two points you have raised:
It would be really useful to know in what workflows these things are often needed and crucial? I have used DataFrames.jl for many years every day and I cannot recall (really) a situation when needed such operations (actually this is probably the reason why it did not get resolved for 1.0 release although we knew we have a limitation here, but we did not feel it is pressing as it was considered a rare use case). Thank you! |
I meant generally not for a specific solution, for example the following cases: (the questions are for each row)
|
@sl-solution What happened to DFRowOperations.jl? I thought it was a good package but it's a 404 now. |
@sl-solution - I support the request by @pdeffebach. We are currently at the stage of development of DataFrames.jl where we would like to go back to this performance issue, hopefully having it in 1.3 release. My current thinking is that we would intercept expressions like Thank you! |
@pdeffebach @bkamins Great!, |
As you might know
DataFrame
is optimised for column operations, and row operations are not efficient. There are some solutions for this and has been previously discussed (#2440, #2757, #2439, #952, ...).Based on my knowledge, the most efficient way (similar-to-work-with-matrix performance) to do this when the problem is fitted into map and reduce, is using
mapreduce
, e.g.At the beginning I thought this is very trivial and just having some documentations about
mapreduce
should be enough forDataFrames.jl
users. However, thinking about it for a while, I guess this is not that much trivial as I thought (particularly ifmissings
are present). Thus, I think having a bunch of common row operations insideDataFrames.jl
would be helpful, particularly, the operations which take care ofmissing
automatically. Since I know this may be controversial, at the moment, I develop a packageDFRowOperation.jl
to define and store a set of common row operations. The users may use, contribute and evaluate this package and if it make sense, it would be great to add its functionality intoDataFrames.jl
in future.you may access the package at
https://github.com/sl-solution/DFRowOperation.jl
and currently contains
the following functionsone functionbyrow
with the support of the following optimised functionalities for row-wise operations:The text was updated successfully, but these errors were encountered: