more careful test of ByRow for PooledArray #2837

bkamins · 2021-08-10T13:29:17Z

I think the fast path for map should be asked for explicitly. Otherwise users will not understand the following result:

julia> let
           id = 0
           df = DataFrame(a=PooledArray([1, 1, 1]))
           function f(x)
               id += 1
               return id
           end
           select(df, :a => ByRow(f) => :a)
       end
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     2
   2 │     2
   3 │     2

pdeffebach · 2021-08-11T14:44:43Z

In the future, in order to support easier out-of-memory operations, we should maybe give this to DataAPI and see if Dagger-esque arrays want to overload the relevant functions.

bkamins · 2021-08-11T15:42:46Z

Yes - I think that out-of-core functionality in DataFrames.jl should be one of the priorities to investigate after 1.3 release (where I want to finish polishing the API).

Hopefully MIT JuliaLab can join the design here - I have had some preliminary discussions with @ViralBShah about it.
Also @quinnj has this issue on the radar.

(I am mentioning all them as adding out-of-core support for DataFrames.jl was not the original goal of the package and it will be a significant effort to do it correctly and efficiently).

ViralBShah · 2021-08-11T18:21:54Z

@shashi spent quite a bit of time on out-of-core with IndexedTables.jl on JuliaDB, and @joshday integrated OnlineStats.jl with it. All that code is in the JuliaDB repo for reviewing from a design perspective. It was built on Dagger, which has since made substantial progress thanks to @jpsamaroo.

Would GPU readiness be an easier lift? Do you have thoughts on whether out of core is better to focus on than distributed in-memory? We thought Dagger could be an answer to both, but the system became quite complex. Of course everything in Julia has become a lot better today.

bkamins · 2021-08-12T06:27:50Z

I have opened a pool https://discourse.julialang.org/t/future-directions-for-dataframes-jl/66247 for this. Can you all please vote? Thank you!

My personal perspective is that we should focus on out-of-core processing. The reason is the following. If the user has some data to process one may hit two bottlenecks:

data does not fit RAM (and thus DataFrames.jl is hard to use) -> solved by out-of-core processing capabilities
processing is too slow (and thus DataFrames.jl is inconvenient to use) -> solved by GPU processing capabilities

Given these two I would prefer to concentrate on harder constraint. The fact that computing is too slow is not such a big problem as we currently are already quite fast for normal processing tasks (all operations take roughly "seconds", so even if we can be faster this will not be that noticeable, especially as GPU support will probably significantly increase compilation latency).

nalimilan · 2021-08-12T14:43:07Z

This seems like the wrong place to fix #2834 to me. If we think the current behavior of map for PooledArray is misleading, we should change it in PooledArrays rather than work around it in DataFrames. Let's continue the discussion at JuliaData/PooledArrays.jl#63.

bkamins · 2021-09-01T07:45:03Z

@nalimilan - this should be good to merge. CI failed because of out of memory on server side that sporadically happens.

test/select.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-09-06T21:17:44Z

Thank you!

fix ByRow for PooledArray

cc926d9

bkamins added the bug label Aug 10, 2021

bkamins added this to the patch milestone Aug 10, 2021

bkamins requested a review from nalimilan August 10, 2021 13:29

fix tests

c7ba0a8

bkamins mentioned this pull request Aug 12, 2021

Purity assumption of map JuliaData/PooledArrays.jl#63

Closed

revert map change in ByRow

eab2431

bkamins changed the title ~~fix ByRow for PooledArray~~ more careful test of ByRow for PooledArray Sep 1, 2021

bkamins closed this Sep 1, 2021

bkamins reopened this Sep 1, 2021

nalimilan approved these changes Sep 3, 2021

View reviewed changes

test/select.jl Outdated Show resolved Hide resolved

Update test/select.jl

a86daf1

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit a41f470 into main Sep 6, 2021

bkamins deleted the bk/fix_byrow branch September 6, 2021 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more careful test of ByRow for PooledArray #2837

more careful test of ByRow for PooledArray #2837

bkamins commented Aug 10, 2021

pdeffebach commented Aug 11, 2021

bkamins commented Aug 11, 2021

ViralBShah commented Aug 11, 2021 •

edited

Loading

bkamins commented Aug 12, 2021

nalimilan commented Aug 12, 2021

bkamins commented Sep 1, 2021

bkamins commented Sep 6, 2021

more careful test of ByRow for PooledArray #2837

more careful test of ByRow for PooledArray #2837

Conversation

bkamins commented Aug 10, 2021

pdeffebach commented Aug 11, 2021

bkamins commented Aug 11, 2021

ViralBShah commented Aug 11, 2021 • edited Loading

bkamins commented Aug 12, 2021

nalimilan commented Aug 12, 2021

bkamins commented Sep 1, 2021

bkamins commented Sep 6, 2021

ViralBShah commented Aug 11, 2021 •

edited

Loading