-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not pool values by default? #822
Comments
Pooling values by default can have a lot of advantages when doing grouping/joining operations in DataFrames, which is why it is turned on by default. The specific issue you're seeing is that |
I think using PooledArrays.jl by default makes sense. If you have large files then it saves a lot of memory in practice and speeds up operations. You can always disable this. The crucial issue is how |
In DataFrames.jl one crucial line where we use
which now uses the
|
I do kind of like the idea of having a |
AFAIK no real-world example where |
I sometimes |
A warning somewhere would be nice. I've spent several hours trying to debug bizarre behavior on DataFrame operations and then realize it is because I am using |
I think it is OK to leave |
Hi,
From the documentation it seems values are pooled (using
PooledArrays
) by default when there are many repeated values. However the behavior of pooled arrays is very different from standard arrays (see example below) and this leads to complications that are hard to predict as the resulting data type depends on the input structure.Switching strings to categorical data by default was previously discarded for similar reasons, so would it make sense to set
pool=0
by default? If not, maybe a warning when the data is pooled could help users to track the issue if they run into it?Cheers.
This was done with Julia 1.5, CSV v0.8.4, DataFrames v0.22.5 and PooledArrays v1.2.1.
The text was updated successfully, but these errors were encountered: