comment:re sklearn -- integer encoding vs 1-hot (py) #1

TELSER1 · 2015-04-27T21:39:41Z

(Your post popped up in my twitter feed)
I'm not sure why you said you needed to one-hot encode categorical variables for scikit's random forest; I'm fairly certain you do not need to(and probably shouldn't). It's been awhile since I looked at the source, but I'm pretty sure it handles categorical variables encoded as a single vector of numbers just fine from empirical tests; performance is almost always worse if the features were one-hot encoded.

szilard · 2015-04-27T21:47:30Z

There is some discussion on this topic here https://stackoverflow.com/questions/15821751/how-to-use-dummy-variable-to-represent-categorical-data-in-python-scikit-learn-r So, I'm not sure the alternative is better, you are welcome to try it if you'd like by changing this https://github.com/szilard/benchm-ml/blob/master/2-rf/2.py and I would be happy to rerun/time it.

szilard · 2015-05-06T20:47:35Z

This thread is interesting read as well https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg07366.html

szilard · 2015-05-06T21:09:02Z

I tried it out: 1. Generate integer-encoded categoricals https://gist.github.com/szilard/b2e97062025ac9347f84 2. https://gist.github.com/szilard/56706595b4594e297414

You get a >10x speedup, lower memory footprint and increase in AUC (n=1M):
1-hot: Time: 900s RAM: 20G AUC: 73.2
int-enc: Time: 60s RAM: 15G AUC: 75.0

I would expect faster, lower memory but decrease in AUC (or same in some cases). I think the increase in AUC is because 3 variables are actually ordinal (month, day of month and day of week). I should probably use 1-hot encoding for those 3 to have a fair comparison with the previous results.

szilard · 2015-05-06T22:20:06Z

Same with mixed encoding (above 3 variables 1-hot, the rest integer encoding - so that I don't give an advantage in accuracy by mapping the ordinal variables to integers):
mixed: Time: 200s RAM: 16G AUC: 73.2

This make sense now, integer encoding vs 1-hot is faster (5x), lower memory footprint, same AUC (though not clear to me when to expect same AUC and when lower even after this excellent thread https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg07366.html )

Any further thoughts @ogrisel @glouppe ?

glouppe · 2015-05-07T05:32:53Z

Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.

I would expect faster, lower memory but decrease in AUC (or same in some cases).

When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced.

szilard · 2015-05-07T05:43:55Z

Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?

glouppe · 2015-05-07T05:55:26Z

Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear.

See section 3.6.3.2 of my thesis if you dont have the CART book.
http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf

szilard · 2015-05-07T06:16:30Z

Great, thanks :)

tqchen · 2015-05-07T17:23:24Z

One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

szilard · 2015-05-07T17:30:42Z

Thanks @tqchen for comments

szilard · 2015-05-12T05:55:21Z

For n=10M the results with integer encoding (along with previous n=1M result):
n=10M Time: 1500s RAM: 120G AUC: 78.3
n=1M Time: 60s RAM: 15G AUC: 75.0

wenbo5565 · 2017-04-02T13:35:41Z

@tqchen Hi Tianqi, would you please explain more on why tree was better at ordered continuous features than discrete ones? is it because there are much more split points for continuous features? Thanks.

szilard changed the title ~~comment:re sklearn~~ comment:re sklearn - integer encoding vs 1-hot (py) May 6, 2015

szilard changed the title ~~comment:re sklearn - integer encoding vs 1-hot (py)~~ comment:re sklearn -- integer encoding vs 1-hot (py) May 6, 2015

szilard mentioned this issue May 7, 2015

mllib test code - RAM / AUC improvements needed #5

Closed

szilard closed this as completed May 13, 2015

szilard mentioned this issue May 31, 2015

best boosting AUC? #15

Closed

szilard mentioned this issue Aug 12, 2015

Integer encoding for categorical variables in random forests in R #22

Closed

szilard mentioned this issue Dec 6, 2016

LightGBM results #46

Closed

szilard mentioned this issue Sep 11, 2020

sklearn HistGBT szilard/GBM-perf#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comment:re sklearn -- integer encoding vs 1-hot (py) #1

comment:re sklearn -- integer encoding vs 1-hot (py) #1

TELSER1 commented Apr 27, 2015

szilard commented Apr 27, 2015

szilard commented May 6, 2015

szilard commented May 6, 2015

szilard commented May 6, 2015

glouppe commented May 7, 2015

szilard commented May 7, 2015

glouppe commented May 7, 2015

szilard commented May 7, 2015

tqchen commented May 7, 2015

szilard commented May 7, 2015

szilard commented May 12, 2015

wenbo5565 commented Apr 2, 2017

comment:re sklearn -- integer encoding vs 1-hot (py) #1

comment:re sklearn -- integer encoding vs 1-hot (py) #1

Comments

TELSER1 commented Apr 27, 2015

szilard commented Apr 27, 2015

szilard commented May 6, 2015

szilard commented May 6, 2015

szilard commented May 6, 2015

glouppe commented May 7, 2015

szilard commented May 7, 2015

glouppe commented May 7, 2015

szilard commented May 7, 2015

tqchen commented May 7, 2015

szilard commented May 7, 2015

szilard commented May 12, 2015

wenbo5565 commented Apr 2, 2017