Do not pool values by default? #822

koalive · 2021-04-08T13:07:59Z

Hi,

From the documentation it seems values are pooled (using PooledArrays) by default when there are many repeated values. However the behavior of pooled arrays is very different from standard arrays (see example below) and this leads to complications that are hard to predict as the resulting data type depends on the input structure.
Switching strings to categorical data by default was previously discarded for similar reasons, so would it make sense to set pool=0 by default? If not, maybe a warning when the data is pooled could help users to track the issue if they run into it?

Cheers.

using CSV, DataFrames

X = DataFrame(x = rand(["A", "B", "C"], 200));

CSV.write("test.csv", X)

Y = CSV.read("test.csv", DataFrame)

typeof(X.x)
# Array{String,1}

typeof(Y.x)
# PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}

map(println, X.x);
# B
# A
# C
# B
# A
# ...

map(println, Y.x);
# B
# A
# C

This was done with Julia 1.5, CSV v0.8.4, DataFrames v0.22.5 and PooledArrays v1.2.1.

The text was updated successfully, but these errors were encountered:

quinnj · 2021-04-13T05:02:44Z

Pooling values by default can have a lot of advantages when doing grouping/joining operations in DataFrames, which is why it is turned on by default. The specific issue you're seeing is that map on a PooledArray assumes the input function is pure, meaning no side effects, and thus is called only on the underlying pool itself instead of the entire array. There's an open issue currently discussing whether that behavior should be kept or at least better documented: JuliaData/PooledArrays.jl#63.

cc: @bkamins @oxinabox

bkamins · 2021-04-13T09:55:41Z

I think using PooledArrays.jl by default makes sense. If you have large files then it saves a lot of memory in practice and speeds up operations. You can always disable this.

The crucial issue is how map should work. As you can see in the linked threads it was my concern. Maybe having map and pure_map differentiated would be the best approach here?

bkamins · 2021-04-13T09:59:55Z

In DataFrames.jl one crucial line where we use map that is impacted would be:

https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/selection.jl#L171

which now uses the map feature we discuss here. Here is an example:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=repeat(1:10, 10^6));

julia> df.y = PooledArray(df.x);

julia> combine(df, :x => ByRow(abs));

julia> @time combine(df, :x => ByRow(abs));
  0.030409 seconds (94 allocations: 76.299 MiB, 15.49% gc time)

julia> combine(df, :y => ByRow(abs));

julia> @time combine(df, :y => ByRow(abs));
  0.012936 seconds (115 allocations: 38.154 MiB)

quinnj · 2021-04-13T15:04:03Z

I do kind of like the idea of having a DataAPI.puremap(f, x) = map(f, x) that we could overload in PooledArrays.jl to do the faster thing (and use in DataFrames.jl).

nalimilan · 2021-04-15T20:53:41Z

AFAIK no real-world example where map purity is a problem has been reported yet so I'm not sure it's a problem in practice.

bkamins · 2021-04-15T21:51:59Z

I sometimes println for debugging :).

hdavid16 · 2021-08-17T20:15:11Z

A warning somewhere would be nice. I've spent several hours trying to debug bizarre behavior on DataFrame operations and then realize it is because I am using map on a DataFrame that was read from a file with pool=true.

bkamins · 2021-08-17T20:41:09Z

I think it is OK to leave PooledArray when it makes sense. We just need to fix map, which will happen in JuliaData/PooledArrays.jl#71 very soon.

quinnj closed this as completed Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not pool values by default? #822

Do not pool values by default? #822

koalive commented Apr 8, 2021

quinnj commented Apr 13, 2021

bkamins commented Apr 13, 2021

bkamins commented Apr 13, 2021

quinnj commented Apr 13, 2021

nalimilan commented Apr 15, 2021

bkamins commented Apr 15, 2021

hdavid16 commented Aug 17, 2021

bkamins commented Aug 17, 2021

Do not pool values by default? #822

Do not pool values by default? #822

Comments

koalive commented Apr 8, 2021

quinnj commented Apr 13, 2021

bkamins commented Apr 13, 2021

bkamins commented Apr 13, 2021

quinnj commented Apr 13, 2021

nalimilan commented Apr 15, 2021

bkamins commented Apr 15, 2021

hdavid16 commented Aug 17, 2021

bkamins commented Aug 17, 2021