Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast row aggregation in DataFrames.jl #2768

Closed
sl-solution opened this issue May 19, 2021 · 35 comments · Fixed by #2869
Closed

Fast row aggregation in DataFrames.jl #2768

sl-solution opened this issue May 19, 2021 · 35 comments · Fixed by #2869

Comments

@sl-solution
Copy link

sl-solution commented May 19, 2021

As you might know DataFrame is optimised for column operations, and row operations are not efficient. There are some solutions for this and has been previously discussed (#2440, #2757, #2439, #952, ...).

Based on my knowledge, the most efficient way (similar-to-work-with-matrix performance) to do this when the problem is fitted into map and reduce, is using mapreduce, e.g.

df = DataFrame(rand(10^5, 100), :auto)
op(x, y) = x .+= y
mapreduce(identity, op, eachcol(df), init = zeros(nrow(df)))

At the beginning I thought this is very trivial and just having some documentations about mapreduce should be enough for DataFrames.jl users. However, thinking about it for a while, I guess this is not that much trivial as I thought (particularly if missings are present). Thus, I think having a bunch of common row operations inside DataFrames.jl would be helpful, particularly, the operations which take care of missing automatically. Since I know this may be controversial, at the moment, I develop a package DFRowOperation.jl to define and store a set of common row operations. The users may use, contribute and evaluate this package and if it make sense, it would be great to add its functionality into DataFrames.jl in future.

you may access the package at

https://github.com/sl-solution/DFRowOperation.jl

and currently contains the following functions one function byrow with the support of the following optimised functionalities for row-wise operations:

- sum
- prod
- cumsum
- cumsum!
- cumprod
- cumprod!
- mean
- count
- any
- all
- minimum
- maximum
- var
- std
- stdze!
- stdze
- sort!
- sort
- nunique
- mapreduce
- reduce
@bkamins bkamins added this to the 1.x milestone May 19, 2021
@bkamins
Copy link
Member

bkamins commented May 19, 2021

This makes a lot of sense. However, if you allow 😄 (given you have already put a lot of work into it) in DataFrames.jl I would do the following (this is easy to do, and is more extensible). I will give an example of only sum, but this extends to other functions.

When user writes:

select(df, cols => ByRow(sum) => outcol)

or (if handling missing values is required)

select(df, cols => ByRow(sum∘skipmissing) => outcol)

then we can intercept ByRow(sum) (and other). The key benefit of our current design of ByRow (and it was intentional) is that:

julia> typeof(ByRow(sum))
ByRow{typeof(sum)}

so we can dispatch on it and provide an efficient aggregation path both for DataFrame and GroupedDataFrame (similarly to how we currently do efficient aggregations in for GroupedDataFrame).

The only consideration is that if we want to be consistent we probably should first introduce AsVector as discussed in the earlier issues so one would write:

select(df, AsVector(cols) => ByRow(sum) => outcol)

as sum expects a single positional argument that is a collection.

The benefit of this approach, over your proposal, is that we do not introduce numerous new verbs. We use what we currently have and only need to add AsVector. Actually the point will be that AsVector will just work as you want things to work, but what we will do is that if you write AsVector(cols) => ByRow(sum) => outcol a special code path would be invoked so that the source vectors would not have to be materialized but an efficient reduction would be performed.

What do you think about it?

CC @nalimilan, @pdeffebach

@sl-solution
Copy link
Author

HandlingGroupedDataFrame in DFRowOperation shouldn't be difficult.

AsVector is a good idea, however, when I was thinking about this problem, the first thing for me was efficiency (compilation time, time, and memory allocation). mapreduce uses multi-dispatch of Julia to handle heterogeneous data type and that is the key benefit for working with DataFrame. On the other hand, at the beginning, I thought AsVector is more general, however, looking at it a little more, I found that the most desirable situations for AsVector are still the kind of problems which fit into map & reduce paradigm.

Adding more verbs, i.e. row_xxxx is not the perfect solution but it is not that much inconvenient. And always we can think about a better design approach to incorporate them into DataFrames.jl.

@bkamins
Copy link
Member

bkamins commented May 20, 2021

Could you please expand on your comment where you see the conflict? What I mean is to use mapreduce approach internally, but with the only twist that we do not have to expose these row_xxxxx functions and instead hide them under a common API.

@bkamins
Copy link
Member

bkamins commented May 20, 2021

I was thinking about it a bit more. Actually your package might make sense to have on its own. Essentially it could drop a dependency on DataFrames.jl and support any Tables.jl compliant source. Then it would be more general. I feel that some people might then find it useful for their workflows that do not use DataFrames.jl.

@bkamins
Copy link
Member

bkamins commented May 20, 2021

In summary I would propose the following plan:

  • I will implement a generic AsVector wrapper for DataFrames.jl, as we want it anyway (this will be a PR that is not intended to make things fast always but to specify the API)
  • in parallel we can discuss the reduction mechanisms
  • then we can decide how to incorporate them into AsVector syntax (my current thinking is that we can use special dispatch paths)

A particular comment that e.g. that:

df = DataFrame(rand(100, 10000), :auto)
select(df, AsVector(:) => sum)
select(df, AsVector(:) => mean)

will already be fast without any special handling, as sum and mean already support fast reductions in their design in Julia Base/Statistics.

@sl-solution
Copy link
Author

... What I mean is to use mapreduce approach internally, but with the only twist that we do not have to expose these row_xxxxx functions and instead hide them under a common API.

I was thinking about this, however, this create one issue which I don't know how to avoid it:
suppose internally when user calls select(df, AsVector(:) => sum, row_sum(df) is run, but user cannot provide options to the function sum conveniently. Calling select(df, AsVector(:) => x ->sum(abs, x)) will not end up with the same path. (any idea?)

Using row_xxxx is not that much bad idea either, it is just a convenient way to use mapreduce.

@sl-solution
Copy link
Author

df = DataFrame(rand(100, 10000), :auto)
select(df, AsVector(:) => sum)
select(df, AsVector(:) => mean)

will already be fast without any special handling, as sum and mean already support fast reductions in their design in Julia Base/Statistics.

I think that would not be that much simple (?), e.g. sum(VectorIterator(df)) needs some manual optimisation.

@bkamins
Copy link
Member

bkamins commented May 20, 2021

Using row_xxxx is not that much bad idea either

Let us see what other maintainers say, but my feeling is that we will not allow for adding that many row_xxxx verbs to DataFrames.jl (they can live in a separate package though).

AsVector(:) => x ->sum(abs, x)

Yes - this will not be able to use the fast path and this would be a limitation of this approach. This is the same we have in fast aggregation.

However, if in the future Julia Base would support a nicer support for currying it would be very easy to add:

AsVector(:) => Base.Fix1(sum, abs)

just right now Base.Fix1(sum, abs) is very ugly and uses an unexported feature so I would not add support for it for now as it would be very unlikely to be used in practice.

e.g. sum(VectorIterator(df)) needs some manual optimisation.

Yes, it would not be optimal, but it would already be relatively good. As commented - we could always add a special path to it.

@bkamins
Copy link
Member

bkamins commented May 20, 2021

Maybe let me add another general comment. In DataFrames.jl we try to have a different design approach than eg. Pandas has. We want to keep the body of the package as minimal as possible, but making sure that the core is flexible enough to cover user's needs. The assumption is that satellite packages (or even some simple Julia Base code) can cover for the rest. The idea is that we do not need to bake in into DataFrames.jl all the functionality because of speed (as opposed to e.g. Pandas where you have to call optimized C code to be fast, so all has to be included). This is a consideration I have in mind when discussing this issue.

This is a different situation than e.g. your #2743 proposal. In #2743 we know that the stack/unstack combo we now have is kind of lacy design that is not optimal, so some methods with a better API for reshaping data frames need to be added at some point to the core of the package. There the issue is that designing such an API is super hard and requires consideration. The key point is that the API should be immediately clear to the user how it should be applied (this is the bane of current stack/unstack - not only they are not flexible enough, but also no one ever remembers the order of their arguments and docstrings need to be constantly looked up).

In summary both this proposal and #2743 hit important things we want to focus on adding post 1.0 release. However, both issues - for different reasons - are not simple decisions to make and quickly merge new functionality.

Let me also give you the following perspective: the proposals by @pstorozenko (a new contributor) of #2726 and #2727 were easy decisions to make a PR and merge as they were both clear improvements that, however, do not affect user facing API. Everything that affects exposed API will be processed much longer. We are after 1.0 release (which we had to do although we knew that there are things to still work on). But being after 1.0 release means that we cannot loosely experiment with user-facing API any more. Whatever we add must be very well thought of so that we are 100% sure it will not change in the future.

@sl-solution
Copy link
Author

Discussing about this issue will be eventually helpful and we can end up with a solid solution.

Let us see what other maintainers say, but my feeling is that we will not allow for adding that many row_xxxx verbs to DataFrames.jl .

I guess this wouldn't be an issue, since a suitable design can solve this (?) (although we get used to this kind of verbs in Julia Base 😄, since, sum, mean, ... are some kind of wrappers around map & reduce)

e.g. sum(VectorIterator(df)) needs some manual optimisation.

Yes, it would not be optimal, but it would already be relatively good. As commented - we could always add a special path to it.

I think VectorIterator is a nice approach, however, wouldn't scale well (unless adjusted case by case which means introducing many path internally).

@nalimilan
Copy link
Member

I fully agree with @bkamins. We should keep the API both minimal and flexible. The strength of DataFrames.jl is that it leverages the fact that Julia allows writing custom code that is fast, so that we don't need to provide dozens of special methods. When particular optimizations are needed, they should happen under the hood, while users still use the more general syntax.

@sl-solution
Copy link
Author

sl-solution commented May 21, 2021

I explored the idea a little more and defined a function called byrow(f, df, cols; [by = identity, ....]), this helps to avoid many verbs, e.g.

julia> df = DataFrame(g = [1, 1, 1, 2, 2],
               x1_int = [0, 0, 1, missing, 2],
               x2_int = [3, 2, 1, 3, -2],
               x1_float = [1.2, missing, -1.0, 2.3, 10],
               x2_float = [missing, missing, 3.0, missing, missing],
           x3_float = [missing, missing, -1.4, 3.0, -100.0])
5×6 DataFrame
 Row │ g      x1_int   x2_int  x1_float   x2_float   x3_float
     │ Int64  Int64?   Int64   Float64?   Float64?   Float64?
─────┼─────────────────────────────────────────────────────────
   11        0       3        1.2  missing    missing
   21        0       2  missing    missing    missing
   31        1       1       -1.0        3.0       -1.4
   42  missing       3        2.3  missing          3.0
   52        2      -2       10.0  missing       -100.0


julia> byrow(sum, df, r"x")
5-element Vector{Union{Missing, Float64}}:
   4.2
   2.0
   2.6
   8.3
 -90.0

julia> byrow(sum, df, r"x", by = abs)
5-element Vector{Union{Missing, Float64}}:
   4.2
   2.0
   7.4
   8.3
 114.0

This can be fit in the current design by a little change. For example in select, something like cols => byrow(f; kwargs...) will be translated to byrow(f, df, cols, kwargs....), or something like f.(eachrow(df[!, cols])) when f doesn't fit into map & reduce. (by argument is following sort convention in Julia)

julia> select(df, :, r"x" => byrow(sum, by = abs) => :total)

5×7 DataFrame
 Row │ g      x1_int   x2_int  x1_float   x2_float   x3_float   total
     │ Int64  Int64?   Int64   Float64?   Float64?   Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────────
   11        0       3        1.2  missing    missing         4.2
   21        0       2  missing    missing    missing         2.0
   31        1       1       -1.0        3.0       -1.4       7.4
   42  missing       3        2.3  missing          3.0       8.3
   52        2      -2       10.0  missing       -100.0     114.0

@bkamins
Copy link
Member

bkamins commented May 21, 2021

Rewriting it a bit, we could - instead of adding AsVector, add something like RowReduce and RowMapReduce that would be only allowed when column selector is not AsTable. Then the syntax could be:

cols => RowReduce(op; init, skipmissing=false)

and

cols => RowMapReduce(f, op; init, skipmissing=false)

which would efficiently call reduce and mapreduce on the collection given by row selector, optionally skipping missing values.

This would be a bit less flexible than AsVector, but maybe indeed supporting only reduce and mapreduce is enough in practice.

So we have three designs on the table:

  • originally planed AsVector
  • your proposal with byrow
  • the approach with RowReduce and RowMapReduce

Let us wait for other users/contributors to comment on what they think

@sl-solution
Copy link
Author

Rewriting it a bit, we could - instead of adding AsVector, add something like RowReduce and RowMapReduce that would be only allowed when column selector is not AsTable. Then the syntax could be:

cols => RowReduce(op; init, skipmissing=false)

and

cols => RowMapReduce(f, op; init, skipmissing=false)

I updated byrow to include mapreduce and reduce.
I think RowReduce and RowMapReduce are not enough, since defining op and init are not trivial, e.g consider the following cases:

  • op for sum is simple but directly using mapreduce will be verbose, particularly when missing exists
  • op for cumsum is simple but not very obvious (op(x, y) = y .+= x), but the same as sum it will be vebose
  • init for maximum and minimum can go wrong easily
  • op and init for any and all are not trivial
  • and for something like nunique which belongs to the same class of problems, there is no simple yet efficient way.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 22, 2021

I've been following this conversation an I have the following thoughts.

The main problem is that NamedTuples that are very large leads to compilation-related slowdowns. If NamedTuples of Vectors didn't have this problem, then we could just use reduce directly as-needed on a NamedTuple of Vectors.

In other words, the problem is not that we don't have a good way to reduce, the problem is that we don't have a good way to deal with lots of columns. So the way to fix that is to have an un-typed (or less typed) collection of columns to work with. So we should introduce AsVector which has the same logic as AsTable.

  • When not ByRow, passes a vector of vectors
  • When ByRow each row acts as a vector. This probably won't be very performant but it will be useful.

I thinkRowReduce is a bit of a red herring because if a user can follow the logic of RowReduce they can just implement reduce themselves. If naively calling reduce is slow, it's on us to make sure the problem is fixed "upstream", i.e. making AsVector as fast as possible.

As for DataFramesMeta, we still need to implement AsTable, but adding AsVector would be trivial. On the other hand, implementing some kind of RowReduce would be very hard and cumbersome. Just look at all the thought that had to go into @byrow for wrapping ByRow.

@bkamins
Copy link
Member

bkamins commented May 22, 2021

i.e. making AsVector as fast as possible.

If we go for this option, the implementation I proposed some time ago would make AsVector relatively fast, as I would allocate the passed vector only once and then reuse it.


In summary the tension is:

  • the AsVector approach; it will be consistent with the whole design and "fast enough", but not "fastest possible"
  • the approach proposed by @sl-solution which will be faster ad the expense of having a different design

What we could have is defune AsVector in DataFrames.jl as previously planned and have https://github.com/sl-solution/DFRowOperation.jl (or whatever name we end up with) as a satellite package that would add extra features.
The benefit of this approach is e.g. that @sl-solution prefers to skip missing by default (and I understand this desire), however, as Julia Base does not do this.

However, let me think a bit more about it. For now I think that AsVector(cols) => ByRow(sum) could be made very efficient by providing a special syntax. The limitation is lack of by option that @sl-solution proposes, but for this I have opened https://discourse.julialang.org/t/improve-currying-in-julia-base/61652 to investigate what could be possible in Julia Base.

@sl-solution
Copy link
Author

sl-solution commented May 22, 2021

Just one thing to think is that AsVector and AsTable will be inefficient for many columns few rows and many rows and few cols, e.g. (I used sum for simplicity - and note that I haven't skipmissing for AsVector and AsTable)

df = DataFrame(randn(10^7,10), :auto)
myfun(x) = x[1]<0 ? x = round.(Int,x) : x
mapcols!(myfun, df)
allowmissing!(df, r"1")

df[1,1]=missing
@time byrow(sum, df)
 0.204496 seconds (263 allocations: 85.845 MiB)

fsum(df) = [sum(x) for x in VectorIterator(df)]

@time fsum(df)
6.876729 seconds (259.99 M allocations: 3.958 GiB, 8.20% gc time)

@time combine(df , AsTable(:)=>ByRow(sum))
9.570441 seconds (250.00 M allocations: 11.856 GiB, 20.97% gc time)

@bkamins
Copy link
Member

bkamins commented May 22, 2021

The implementation of VectorIterator in #2440 was preliminary and mostly intended for homogeneous columns (and then it is fast as I assume you have checked). However, indeed maybe it is not possible to make it fast for heterogeneous columns. The question is: how important in practice the case of heterogeneous columns is.

Secondly - byrow will be also inefficient if you use it on a general function that it is not specialized for. AsVector{cols} => sum can easily dispatch to a special code (exactly like your byrow) and be efficient as it would not use VectorIterator at all (which would only be used in a generic case). The idea of VectorIterator was to have a fast code in case of columns having a constant type, as then we can be fast cheaply and eachcol(df) is not going to give us that.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 22, 2021

I agree. the code for byrow is very impressive, and we can absolutely intercept AsVector(cols) => sum or AsVector(cols) => ByRow(sum) and use an optimized implementation as you did.

Skipping missings might make this story more complicated, but as mentioned by Bogumil, we really can't deviate from Base Julia on this. I definitely hope we can make fast and easy syntax for this. i.e., I guess we can also intercept AsVector(cols) => ByRow(sum ∘ skipmissing)

@bkamins
Copy link
Member

bkamins commented May 22, 2021

AsVector(cols) => ByRow(sum ∘ skipmissing) can be easily intercepted.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 22, 2021

I think one other note, @sl-solution, is that if we go with AsVector, then it would be great if your DFRowOperations became oriented around dealing with Vectors of Vectors rather than DataFrames.

It seems unlikely DataFrames would implement all of your really finely-tuned functions, but if it were super easy to use them inside a transform block, i.e. by accepting a Vector of Vectors, then the user would be able to use all of your functions without learning a new API that accepts DataFrames as a whole.

Something like

transform(df, AsVector(:) => sl_solution_sum_skipping_missings)

@sl-solution
Copy link
Author

(and then it is fast as I assume you have checked).

😃 I checked indeed, but it is about the same timing as what I report in the last comment. by argument is also critical, because AsVector will be about 3 times slower if a function needed to be applied first.

but if it were super easy to use them inside a transform block, i.e. by accepting a Vector of Vectors, then the user would be able to use all of your functions without learning a new API that accepts DataFrames as a whole...

missing can be discussed and it is not my main point (although I think the approach in Base is wrong).
I guess a better design would be having byrow (in some acceptable form) and intercept it with AsVector internally.

one question: is AsTable only for fast row operation?

@pdeffebach
Copy link
Contributor

one question: is AsTable only for fast row operation?

AsTable is useful beyond it's speed. A NamedTuple of Vectors is kind of the most basic concept of Table as defined by Tables.jl so it's an easy way to send interact with any function that accepts a table source, and its typed where DataFrames are not, which can lead to faster code.

As for row operations. Yeah, similarly, a NamedTuple of scalars is the most basic "row" that interacts with Tables.jl. So AsTable => ByRow(f) is great for composibility with the rest of the ecosystem.

I guess a better design would be having byrow (in some acceptable form) and intercept it with AsVector internally.

I just want to clarify this one point, AsVector, while being a Julia object, is not what gets sent to functions. What gets sent to functions would be a Vector of Vectors. AsVector would be internal to DataFrames only (like how AsTable is) and is just part of the "mini-language".

AsTable is to NamedTuple of Vectors as AsVector is to a Vector of Vectors. Functions would only know about Vectors of Vectors, not AsVector.

@bkamins
Copy link
Member

bkamins commented May 22, 2021

one question: is AsTable only for fast row operation?

Just to add - as @pdeffebach commented. The design API in DataFrames.jl is driven by the following considerations:

  1. composability
  2. avoiding embarassingly slow execution
  3. if possible provide very fast execution

The point is that DataFrames.jl was never intended to be a super fast package - its main point was convenience. If someone wants super fast operations then probably other data structures are better suited than DataFrame which is inherently slow by design because of its type instability.

Having said that it does not mean that we do not want to have a good speed - we want to. Just the mental model of development is that we first need to assure that what we provide is consistent with the whole JuliaData ecosystem, and only then we think how to make it fast.

With row aggregations we are currently in the state of "embarassingly slow execution" with the API we have now, so clearly this should be fixed but we need to do it in a way that is consistent with the ecosystem and easy to maintain in the long run. That is why we are hesitating with making a decision.

@sl-solution
Copy link
Author

I just want to clarify this one point, AsVector, while being a Julia object, is not what gets sent to functions. What gets sent to functions would be a Vector of Vectors.

But, thankfully, it is not Vector of Vector 😃

one thing to think is byrow not only makes things faster and more efficient, but it makes the design leaner.

@sl-solution
Copy link
Author

...That is why we are hesitating with making a decision.

I totally understand it and it makes sense.

The current implementation of byrow (I know it is a fact that it is not been checked and not been optimised) is just (maybe 3-5 times) better than other packages, like pandas byrow operations (and something like polars will be even better in future), so we will be embarrassingly slow if we don't think hard.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 23, 2021

But, thankfully, it is not Vector of Vector

Sorry, just to clarify, I mean a Vector of Vectors as in eachcol(df), not VectorIterator, (rows as vectors), which we know will always be pretty slow.

one thing to think is byrow not only makes things faster and more efficient, but it makes the design leaner.

I agree that the API of byrow is nice, but the transorm(df, src => fun => dest) syntax is really flexible and people have had a certain amount of success working with it. Combined with DataFramesMeta I think it's a pretty robust system that can handle a lot of different operations.

I see the following way forward, and I'm wondering if you can agree

  1. Introduce AsVector (default being a vector of columns, with ByRow being a vector of rows.
  2. Port some functions currently in DFRowOperations, and intercept common patterns like ByRow(sum), ByRow(mean) etc. This will be just as fast as anything in DFRowOperations.
  3. For the remaining operations, point users to DFRowOperations. Also, make DFRowOperations DataFrames-independent. i.e. have it assume only that objects follow the Tables API. That way it can be used with any table, not just data-frames.
  4. Optionally over-load functions in DFRowOperations to work with Vectors of columns, i.e.
julia> using Tables

julia> function byrow_sum(t)
           v_of_vs = [Tables.getcolumn(t, s) for s in Tables.columnnames(t)]
           byrow_sum(v_of_vs)
       end;

julia> function byrow_sum(t::Vector{Vector{T}}) where T <: Real
           reduce(+, t)
       end;

julia> t = (a = rand(100), b = rand(100));

julia> byrow_sum(t);

(I'm sure there are other Tables.jl functions that would make this more composable). But despite allocating a vector in the outer function, the timing of this function is comparable to byrow(sum, df) when I optimize a little bit by copying DFRowOperation.

I say "optionally" because we have a syntax for transform(df, fun) (no source or dest) for which fun acts on a DataFrame. So any function that accepts a tables-compatible source can also take advantage of this syntax and it would be easy to have DFRowOperation be a "standalone" API that also interacts reasonably well with transform. But having it accept a a vector of columns allow us to take advantage of the src => fun => dest syntax even more.

EDIT: There are some complicated concerns here about typed vs untyped tables and when a Vector of Vectors is strictly superior to a NamedTuple of Vectors that could probably be handled via the Tables API. Given that we have the Tables API I'm actually hesitant to suggest package writers also support vectors of vectors... maybe the solution is just an un-typed version of AsTable...

@bkamins
Copy link
Member

bkamins commented May 23, 2021

un-typed version of AsTable would be just to pass a data frame 😄. It was also discussed as an alternative/addition to to AsVector. The question is which one would be preferable. This is mainly important for "slow path". For fast path AsVector or AsDF is indistinguishable (as we would write a custom code handling this efficiently anyway). Intuitively AsVector makes more sense, as if you write AsVector{cols) => sum it is clear that you are consistent with sum API.

@bkamins
Copy link
Member

bkamins commented May 23, 2021

Having the basic functions in DataFrames.jl and extra functions in DFRowOperations.jl also makes sense to me. Actually I think we are getting to the point that DataFramesMeta.jl could be accompanied by DataFramesKit.jl (and the second would hold extra functions that users find useful, but we are hesitant to add hem to the core of package to keep it lightweight). The benefit is that we could make DataFramesKit.jl 0.x and clearly indicate that the API might evolve (as opposed to DataFrames.jl).

@sl-solution
Copy link
Author

I think byrow can fit into the transform, select, ... design.

  1. Port some functions currently in DFRowOperations, and intercept common patterns like ByRow(sum), ByRow(mean) etc. This will be just as fast as anything in DFRowOperations.

the current design of ByRow doesn't allow passing options.

  1. Optionally over-load functions in DFRowOperations to work with Vectors of columns, i.e.

mapreduce the core of byrow already works with Vectors of columns and that's not the problem.

Working with individual columns of DataFrame doesn't need type stability (Julia dispatcher takes care of it), and introducing byrow uses Julia itself to solve the type stability when there are multiple heterogeneous (in some sense) columns. the point is DataFrames can solve most of its problems without using NamedTuple.

I am not fan of other package, since row operations are, in some sense, essential for DataFrames.jl, and AsVector , AsTable are not efficient enough. Two things should be kept in mind, in real life:

  • data contains heterogeneous columns (AsVector wouldn't solve this easily)
  • passing options to row operator is very crucial (so ByRow and AsTable are very restricted)

@bkamins
Copy link
Member

bkamins commented May 23, 2021

It would be interesting if you expanded on two points you have raised:

since row operations are, in some sense, essential for DataFrames.jl

passing options to row operator is very crucial

It would be really useful to know in what workflows these things are often needed and crucial? I have used DataFrames.jl for many years every day and I cannot recall (really) a situation when needed such operations (actually this is probably the reason why it did not get resolved for 1.0 release although we knew we have a limitation here, but we did not feel it is pressing as it was considered a rare use case).

Thank you!

@sl-solution
Copy link
Author

It would be interesting if you expanded on two points you have raised: .....

I meant generally not for a specific solution, for example the following cases: (the questions are for each row)

  • sum of rows
  • mean of rows
  • rows which doesn't have missing values
  • is there any sell on Sunday in each row
  • how may positive numbers in each row
  • which month (or day of week) is the most frequent for a patient

@pdeffebach
Copy link
Contributor

@sl-solution What happened to DFRowOperations.jl? I thought it was a good package but it's a 404 now.

@bkamins bkamins modified the milestones: 1.x, 1.3 Sep 3, 2021
@bkamins
Copy link
Member

bkamins commented Sep 3, 2021

@sl-solution - I support the request by @pdeffebach. We are currently at the stage of development of DataFrames.jl where we would like to go back to this performance issue, hopefully having it in 1.3 release.

My current thinking is that we would intercept expressions like AsTable(cols) => ByRow(mean) or AsTable(cols) => ByRow(means∘skipmissing) => :out_col and use a specialized fast path for processing it for AbstractDataFrame objects (exactly like we now do a similar thing when someone writes :col => sum => :out_col in GroupDataFrame aggregation). But it would be great to consult the implementation with you and against https://github.com/sl-solution/DFRowOperation.jl to make sure we have an efficient code.

Thank you!

@bkamins bkamins linked a pull request Sep 9, 2021 that will close this issue
@sl-solution
Copy link
Author

@pdeffebach @bkamins Great!,
I am currently very busy, if you like the idea use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants