Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recode can be too slow #343

Closed
pgagarinov opened this issue Apr 12, 2021 · 9 comments
Closed

recode can be too slow #343

pgagarinov opened this issue Apr 12, 2021 · 9 comments

Comments

@pgagarinov
Copy link
Contributor

On the following screenshot I compare performance of recode(cat_data_vec, cat_vec=>"None") for 2 cases:
a) cat_vec is a categorical array
b) cat_vec is a string array

It turns out that performance of a) is significantly lower

image

It is not easy to reproduce the real scale of performance differences on artificial data, by on real data (300k records and 30k unique categories) case a) is actually 10 times slower (40 seconds) than b) (4 seconds).

@nalimilan
Copy link
Member

The code has been optimized for a relatively small number of levels. Maybe this could be improved, but I'm not sure recode is the best tool for this task.

@pgagarinov
Copy link
Contributor Author

pgagarinov commented Apr 15, 2021

The code has been optimized for a relatively small number of levels. Maybe this could be improved, but I'm not sure recode is the best tool for this task.

Here is my use case - I have categorical features I use for ML and some categories are rare so I merge them into "None" category. I use categorical arrays because MLJ's API expects categorical features as categorical arrays. Yes, I can use just Vector{String} until I pass features into MLJ but this special treatment makes things less convenient.

@nalimilan
Copy link
Member

Maybe a function similar to dplyr's fct_collapse would be more appropriate, and simpler to optimize.

That said, there are probably quite a few long-hanging fruits. Have you tried with master? Performance should already be better. You could try changing this line to do something like sfirst = Set(first); recode_in(l, sfirst):

any(f -> recode_in(l, f), firsts))

@pgagarinov
Copy link
Contributor Author

pgagarinov commented Apr 17, 2021

@nalimilan

Maybe a function similar to dplyr's fct_collapse would be more appropriate, and simpler to optimize.

This makes a perfect sense, such a function would perfectly fit my use case.

I've prepared some more representative data (attaching csv files).

Here are the results from the latest released version 0.9.5:

image

This is the result for the master (btw, the version of CategoricalArray in master is 0.9.3 which doesn't make much sense)

image

Thus recode in the master is almost 2 time faster than in 0.9.5 but it is still at least two times slower when categorical values to recode are provided as a categorical array as opposed to being provided as a vector of strings.

You could try changing this line to do something like sfirst = Set(first); recode_in(l, sfirst):

I've made the change you've suggested but it doesn't seem to help much, maybe just a tiny bit:

image

The data used in the test:
recode_performance_test_data.zip

@nalimilan
Copy link
Member

OK. Some profiling is needed to check where most of the time is spent.

(Could you copy/paste code instead of screenshots? That's much easier to reproduce.)

@nalimilan
Copy link
Member

Actually the most important line is probably this one:

if l p.first || recode_in(l, p.first)

On master, you could try doing simply recode(orig_vec, Set(cat2merge_vec) => "None"), that should have the same effect as modifying the function.

@pgagarinov
Copy link
Contributor Author

pgagarinov commented Apr 17, 2021

Could you copy/paste code instead of screenshots? That's much easier to reproduce.

Sure

using CategoricalArrays;
using DelimitedFiles;
orig_vec = readdlm("orig_vec.csv");
cat2merge_vec = readdlm("cat2rep_vec.csv");
orig_vec = categorical(orig_vec);
cat2merge_vec = categorical(cat2merge_vec);
@time recode(orig_vec, cat2merge_vec=>"None");
@time recode(orig_vec, cat2merge_vec=>"None");
@time recode(orig_vec, string.(cat2merge_vec)=>"None");
@time recode(orig_vec, string.(cat2merge_vec)=>"None");

@pgagarinov
Copy link
Contributor Author

Actually the most important line is probably this one:

if l p.first || recode_in(l, p.first)

On master, you could try doing simply recode(orig_vec, Set(cat2merge_vec) => "None"), that should have the same effect as modifying the function.

That makes a huge difference:

julia> @time recode(orig_vec, Set(cat2merge_vec) => "None");
  0.015400 seconds (29.54 k allocations: 3.789 MiB)

@nalimilan
Copy link
Member

Cool. So maybe we could always create a Set when first is a vector. Would you make a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants