-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more careful test of ByRow for PooledArray #2837
Conversation
In the future, in order to support easier out-of-memory operations, we should maybe give this to DataAPI and see if Dagger-esque arrays want to overload the relevant functions. |
Yes - I think that out-of-core functionality in DataFrames.jl should be one of the priorities to investigate after 1.3 release (where I want to finish polishing the API). Hopefully MIT JuliaLab can join the design here - I have had some preliminary discussions with @ViralBShah about it. (I am mentioning all them as adding out-of-core support for DataFrames.jl was not the original goal of the package and it will be a significant effort to do it correctly and efficiently). |
@shashi spent quite a bit of time on out-of-core with IndexedTables.jl on JuliaDB, and @joshday integrated OnlineStats.jl with it. All that code is in the JuliaDB repo for reviewing from a design perspective. It was built on Dagger, which has since made substantial progress thanks to @jpsamaroo. Would GPU readiness be an easier lift? Do you have thoughts on whether out of core is better to focus on than distributed in-memory? We thought Dagger could be an answer to both, but the system became quite complex. Of course everything in Julia has become a lot better today. |
I have opened a pool https://discourse.julialang.org/t/future-directions-for-dataframes-jl/66247 for this. Can you all please vote? Thank you! My personal perspective is that we should focus on out-of-core processing. The reason is the following. If the user has some data to process one may hit two bottlenecks:
Given these two I would prefer to concentrate on harder constraint. The fact that computing is too slow is not such a big problem as we currently are already quite fast for normal processing tasks (all operations take roughly "seconds", so even if we can be faster this will not be that noticeable, especially as GPU support will probably significantly increase compilation latency). |
This seems like the wrong place to fix #2834 to me. If we think the current behavior of |
@nalimilan - this should be good to merge. CI failed because of out of memory on server side that sporadically happens. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
Fixes #2834
I think the fast path for
map
should be asked for explicitly. Otherwise users will not understand the following result: