Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more careful test of ByRow for PooledArray #2837

Merged
merged 4 commits into from
Sep 6, 2021
Merged

more careful test of ByRow for PooledArray #2837

merged 4 commits into from
Sep 6, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Aug 10, 2021

Fixes #2834

I think the fast path for map should be asked for explicitly. Otherwise users will not understand the following result:

julia> let
           id = 0
           df = DataFrame(a=PooledArray([1, 1, 1]))
           function f(x)
               id += 1
               return id
           end
           select(df, :a => ByRow(f) => :a)
       end
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     2
   2 │     2
   3 │     2

@bkamins bkamins added the bug label Aug 10, 2021
@bkamins bkamins added this to the patch milestone Aug 10, 2021
@bkamins bkamins requested a review from nalimilan August 10, 2021 13:29
@pdeffebach
Copy link
Contributor

In the future, in order to support easier out-of-memory operations, we should maybe give this to DataAPI and see if Dagger-esque arrays want to overload the relevant functions.

@bkamins
Copy link
Member Author

bkamins commented Aug 11, 2021

Yes - I think that out-of-core functionality in DataFrames.jl should be one of the priorities to investigate after 1.3 release (where I want to finish polishing the API).

Hopefully MIT JuliaLab can join the design here - I have had some preliminary discussions with @ViralBShah about it.
Also @quinnj has this issue on the radar.

(I am mentioning all them as adding out-of-core support for DataFrames.jl was not the original goal of the package and it will be a significant effort to do it correctly and efficiently).

@ViralBShah
Copy link
Contributor

ViralBShah commented Aug 11, 2021

@shashi spent quite a bit of time on out-of-core with IndexedTables.jl on JuliaDB, and @joshday integrated OnlineStats.jl with it. All that code is in the JuliaDB repo for reviewing from a design perspective. It was built on Dagger, which has since made substantial progress thanks to @jpsamaroo.

Would GPU readiness be an easier lift? Do you have thoughts on whether out of core is better to focus on than distributed in-memory? We thought Dagger could be an answer to both, but the system became quite complex. Of course everything in Julia has become a lot better today.

@bkamins
Copy link
Member Author

bkamins commented Aug 12, 2021

I have opened a pool https://discourse.julialang.org/t/future-directions-for-dataframes-jl/66247 for this. Can you all please vote? Thank you!

My personal perspective is that we should focus on out-of-core processing. The reason is the following. If the user has some data to process one may hit two bottlenecks:

  • data does not fit RAM (and thus DataFrames.jl is hard to use) -> solved by out-of-core processing capabilities
  • processing is too slow (and thus DataFrames.jl is inconvenient to use) -> solved by GPU processing capabilities

Given these two I would prefer to concentrate on harder constraint. The fact that computing is too slow is not such a big problem as we currently are already quite fast for normal processing tasks (all operations take roughly "seconds", so even if we can be faster this will not be that noticeable, especially as GPU support will probably significantly increase compilation latency).

@nalimilan
Copy link
Member

This seems like the wrong place to fix #2834 to me. If we think the current behavior of map for PooledArray is misleading, we should change it in PooledArrays rather than work around it in DataFrames. Let's continue the discussion at JuliaData/PooledArrays.jl#63.

@bkamins bkamins changed the title fix ByRow for PooledArray more careful test of ByRow for PooledArray Sep 1, 2021
@bkamins bkamins closed this Sep 1, 2021
@bkamins bkamins reopened this Sep 1, 2021
@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2021

@nalimilan - this should be good to merge. CI failed because of out of memory on server side that sporadically happens.

test/select.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit a41f470 into main Sep 6, 2021
@bkamins bkamins deleted the bk/fix_byrow branch September 6, 2021 21:17
@bkamins
Copy link
Member Author

bkamins commented Sep 6, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should ByRow use map or not
4 participants