-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recode can be too slow #343
Comments
The code has been optimized for a relatively small number of levels. Maybe this could be improved, but I'm not sure |
Here is my use case - I have categorical features I use for ML and some categories are rare so I merge them into "None" category. I use categorical arrays because MLJ's API expects categorical features as categorical arrays. Yes, I can use just Vector{String} until I pass features into MLJ but this special treatment makes things less convenient. |
Maybe a function similar to dplyr's That said, there are probably quite a few long-hanging fruits. Have you tried with master? Performance should already be better. You could try changing this line to do something like CategoricalArrays.jl/src/recode.jl Line 182 in b2177ab
|
This makes a perfect sense, such a function would perfectly fit my use case. I've prepared some more representative data (attaching csv files). Here are the results from the latest released version 0.9.5: This is the result for the master (btw, the version of CategoricalArray in master is 0.9.3 which doesn't make much sense) Thus
I've made the change you've suggested but it doesn't seem to help much, maybe just a tiny bit: The data used in the test: |
OK. Some profiling is needed to check where most of the time is spent. (Could you copy/paste code instead of screenshots? That's much easier to reproduce.) |
Actually the most important line is probably this one: CategoricalArrays.jl/src/recode.jl Line 229 in b2177ab
On master, you could try doing simply |
Sure
|
That makes a huge difference:
|
Cool. So maybe we could always create a |
On the following screenshot I compare performance of
recode(cat_data_vec, cat_vec=>"None")
for 2 cases:a) cat_vec is a categorical array
b) cat_vec is a string array
It turns out that performance of a) is significantly lower
It is not easy to reproduce the real scale of performance differences on artificial data, by on real data (300k records and 30k unique categories) case a) is actually 10 times slower (40 seconds) than b) (4 seconds).
The text was updated successfully, but these errors were encountered: