recode can be too slow #343

pgagarinov · 2021-04-12T14:28:59Z

On the following screenshot I compare performance of recode(cat_data_vec, cat_vec=>"None") for 2 cases:
a) cat_vec is a categorical array
b) cat_vec is a string array

It turns out that performance of a) is significantly lower

It is not easy to reproduce the real scale of performance differences on artificial data, by on real data (300k records and 30k unique categories) case a) is actually 10 times slower (40 seconds) than b) (4 seconds).

The text was updated successfully, but these errors were encountered:

nalimilan · 2021-04-15T20:29:03Z

The code has been optimized for a relatively small number of levels. Maybe this could be improved, but I'm not sure recode is the best tool for this task.

pgagarinov · 2021-04-15T20:52:03Z

The code has been optimized for a relatively small number of levels. Maybe this could be improved, but I'm not sure recode is the best tool for this task.

Here is my use case - I have categorical features I use for ML and some categories are rare so I merge them into "None" category. I use categorical arrays because MLJ's API expects categorical features as categorical arrays. Yes, I can use just Vector{String} until I pass features into MLJ but this special treatment makes things less convenient.

nalimilan · 2021-04-15T21:01:39Z

Maybe a function similar to dplyr's fct_collapse would be more appropriate, and simpler to optimize.

That said, there are probably quite a few long-hanging fruits. Have you tried with master? Performance should already be better. You could try changing this line to do something like sfirst = Set(first); recode_in(l, sfirst):

CategoricalArrays.jl/src/recode.jl

Line 182 in b2177ab

any(f -> recode_in(l, f), firsts))

pgagarinov · 2021-04-17T14:52:25Z

@nalimilan

Maybe a function similar to dplyr's fct_collapse would be more appropriate, and simpler to optimize.

This makes a perfect sense, such a function would perfectly fit my use case.

I've prepared some more representative data (attaching csv files).

Here are the results from the latest released version 0.9.5:

This is the result for the master (btw, the version of CategoricalArray in master is 0.9.3 which doesn't make much sense)

Thus recode in the master is almost 2 time faster than in 0.9.5 but it is still at least two times slower when categorical values to recode are provided as a categorical array as opposed to being provided as a vector of strings.

You could try changing this line to do something like sfirst = Set(first); recode_in(l, sfirst):

I've made the change you've suggested but it doesn't seem to help much, maybe just a tiny bit:

The data used in the test:
recode_performance_test_data.zip

nalimilan · 2021-04-17T21:12:15Z

OK. Some profiling is needed to check where most of the time is spent.

(Could you copy/paste code instead of screenshots? That's much easier to reproduce.)

nalimilan · 2021-04-17T21:14:32Z

Actually the most important line is probably this one:

CategoricalArrays.jl/src/recode.jl

Line 229 in b2177ab

if l ≅ p.first || recode_in(l, p.first)

On master, you could try doing simply recode(orig_vec, Set(cat2merge_vec) => "None"), that should have the same effect as modifying the function.

pgagarinov · 2021-04-17T21:17:11Z

Could you copy/paste code instead of screenshots? That's much easier to reproduce.

Sure

using CategoricalArrays;
using DelimitedFiles;
orig_vec = readdlm("orig_vec.csv");
cat2merge_vec = readdlm("cat2rep_vec.csv");
orig_vec = categorical(orig_vec);
cat2merge_vec = categorical(cat2merge_vec);
@time recode(orig_vec, cat2merge_vec=>"None");
@time recode(orig_vec, cat2merge_vec=>"None");
@time recode(orig_vec, string.(cat2merge_vec)=>"None");
@time recode(orig_vec, string.(cat2merge_vec)=>"None");

pgagarinov · 2021-04-17T21:20:02Z

Actually the most important line is probably this one:

CategoricalArrays.jl/src/recode.jl

Line 229 in b2177ab

if l ≅ p.first || recode_in(l, p.first)

On master, you could try doing simply recode(orig_vec, Set(cat2merge_vec) => "None"), that should have the same effect as modifying the function.

That makes a huge difference:

julia> @time recode(orig_vec, Set(cat2merge_vec) => "None");
  0.015400 seconds (29.54 k allocations: 3.789 MiB)

nalimilan · 2021-04-17T21:29:46Z

Cool. So maybe we could always create a Set when first is a vector. Would you make a pull request?

…a#343).

pgagarinov mentioned this issue Apr 19, 2021

Optimize recode for the large number of categories when the categories to be recoded are specified as arrays #345

Merged

pgagarinov pushed a commit to pgagarinov/CategoricalArrays.jl that referenced this issue Apr 27, 2021

Recode is optimized by transforming arrays into sets (solves JuliaDat…

3d41f2b

…a#343).

nalimilan closed this as completed May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recode can be too slow #343

recode can be too slow #343

pgagarinov commented Apr 12, 2021

nalimilan commented Apr 15, 2021

pgagarinov commented Apr 15, 2021 •

edited

Loading

nalimilan commented Apr 15, 2021

pgagarinov commented Apr 17, 2021 •

edited

Loading

nalimilan commented Apr 17, 2021

nalimilan commented Apr 17, 2021

pgagarinov commented Apr 17, 2021 •

edited

Loading

pgagarinov commented Apr 17, 2021

nalimilan commented Apr 17, 2021

recode can be too slow #343

recode can be too slow #343

Comments

pgagarinov commented Apr 12, 2021

nalimilan commented Apr 15, 2021

pgagarinov commented Apr 15, 2021 • edited Loading

nalimilan commented Apr 15, 2021

pgagarinov commented Apr 17, 2021 • edited Loading

nalimilan commented Apr 17, 2021

nalimilan commented Apr 17, 2021

pgagarinov commented Apr 17, 2021 • edited Loading

pgagarinov commented Apr 17, 2021

nalimilan commented Apr 17, 2021

pgagarinov commented Apr 15, 2021 •

edited

Loading

pgagarinov commented Apr 17, 2021 •

edited

Loading

pgagarinov commented Apr 17, 2021 •

edited

Loading