Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not pool values by default? #822

Closed
koalive opened this issue Apr 8, 2021 · 8 comments
Closed

Do not pool values by default? #822

koalive opened this issue Apr 8, 2021 · 8 comments

Comments

@koalive
Copy link

koalive commented Apr 8, 2021

Hi,

From the documentation it seems values are pooled (using PooledArrays) by default when there are many repeated values. However the behavior of pooled arrays is very different from standard arrays (see example below) and this leads to complications that are hard to predict as the resulting data type depends on the input structure.
Switching strings to categorical data by default was previously discarded for similar reasons, so would it make sense to set pool=0 by default? If not, maybe a warning when the data is pooled could help users to track the issue if they run into it?

Cheers.

using CSV, DataFrames

X = DataFrame(x = rand(["A", "B", "C"], 200));

CSV.write("test.csv", X)

Y = CSV.read("test.csv", DataFrame)

typeof(X.x)
# Array{String,1}

typeof(Y.x)
# PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}

map(println, X.x);
# B
# A
# C
# B
# A
# ...

map(println, Y.x);
# B
# A
# C

This was done with Julia 1.5, CSV v0.8.4, DataFrames v0.22.5 and PooledArrays v1.2.1.

@quinnj
Copy link
Member

quinnj commented Apr 13, 2021

Pooling values by default can have a lot of advantages when doing grouping/joining operations in DataFrames, which is why it is turned on by default. The specific issue you're seeing is that map on a PooledArray assumes the input function is pure, meaning no side effects, and thus is called only on the underlying pool itself instead of the entire array. There's an open issue currently discussing whether that behavior should be kept or at least better documented: JuliaData/PooledArrays.jl#63.

cc: @bkamins @oxinabox

@quinnj quinnj closed this as completed Apr 13, 2021
@bkamins
Copy link
Member

bkamins commented Apr 13, 2021

I think using PooledArrays.jl by default makes sense. If you have large files then it saves a lot of memory in practice and speeds up operations. You can always disable this.

The crucial issue is how map should work. As you can see in the linked threads it was my concern. Maybe having map and pure_map differentiated would be the best approach here?

@bkamins
Copy link
Member

bkamins commented Apr 13, 2021

In DataFrames.jl one crucial line where we use map that is impacted would be:

https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/selection.jl#L171

which now uses the map feature we discuss here. Here is an example:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=repeat(1:10, 10^6));

julia> df.y = PooledArray(df.x);

julia> combine(df, :x => ByRow(abs));

julia> @time combine(df, :x => ByRow(abs));
  0.030409 seconds (94 allocations: 76.299 MiB, 15.49% gc time)

julia> combine(df, :y => ByRow(abs));

julia> @time combine(df, :y => ByRow(abs));
  0.012936 seconds (115 allocations: 38.154 MiB)

@quinnj
Copy link
Member

quinnj commented Apr 13, 2021

I do kind of like the idea of having a DataAPI.puremap(f, x) = map(f, x) that we could overload in PooledArrays.jl to do the faster thing (and use in DataFrames.jl).

@nalimilan
Copy link
Member

AFAIK no real-world example where map purity is a problem has been reported yet so I'm not sure it's a problem in practice.

@bkamins
Copy link
Member

bkamins commented Apr 15, 2021

I sometimes println for debugging :).

@hdavid16
Copy link

A warning somewhere would be nice. I've spent several hours trying to debug bizarre behavior on DataFrame operations and then realize it is because I am using map on a DataFrame that was read from a file with pool=true.

@bkamins
Copy link
Member

bkamins commented Aug 17, 2021

I think it is OK to leave PooledArray when it makes sense. We just need to fix map, which will happen in JuliaData/PooledArrays.jl#71 very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants