Adaptative Kmedoid discretizer for numerical feature engineering.
kmedoid-discretizer (Adaptative Kmedoid discretizer) allows to discritize numerical feature into n_bins
using Kmedoids Clustering algrorithm compatible sklearn (Alternative to sklearn KBinsDiscretizer).
With this implemenation, we can have:
- A custom number of bins for each numeral feature. Kmedoids will be run for each columns.
- Adapt the number of bins dynamically whenever this one is two high (more precesly when two centroids are assigned to the same data point.)
- Multiple Backends are possible: serial, multiprocessing, and ray to speed up the Kmedoids compuation.
- Mainly use Pandas DataFrame and Numpy array.
pip install git+ssh://git@github.com/Vic-ai/kmedoid-discretizer.git
git clone git@github.com:Vic-ai/kmedoid-discretizer.git
- Download poetry (
make poetry-download
). See poetry doc: https://python-poetry.org/docs/ - Install the dev requirements into a virtualenv. (
make install
)
Here is the Basic use-case data
# Fake training set
X = pd.DataFrame.from_dict({f"feature": [1, 2, 2, 3]})
# Fake Testing set
X_test = pd.DataFrame.from_dict({f"feature": [0, 2, 5]})
discretizer = KmedoidDiscretizer(2)
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
feature
0 0
1 0
2 0
3 1
feature
0 0
1 0
2 1
discretizer = KmedoidDiscretizer(2, encoding="onehot-dense")
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
index 0 1
0 0 1.0 0.0
1 1 1.0 0.0
2 2 1.0 0.0
3 3 0.0 1.0
index 0 1
0 0 1.0 0.0
1 1 1.0 0.0
2 2 0.0 1.0
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from kmedoid_discretizer.discretizer import KmedoidDiscretizer
from kmedoid_discretizer.utils.utils_external import PandasSimpleImputer
np.random.seed(0)
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
cat_features = ["pclass", "sex"]
num_features = ["age", "fare", "sibsp", "parch"] # The one we will discritize
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
# Numerical Transformer Pipeline
numeric_transformer = Pipeline(
steps=[
("imputer", PandasSimpleImputer(strategy="median")),
("discretizer", KmedoidDiscretizer(
n_bins=[8, 5, 7, 7],
encode="onehot-dense",
backend="serial",
verbose=True,
seed=0,
)),
]
)
# Categorical Transformer Pipeline
categorical_transformer = Pipeline(
steps=[
("encoder", OneHotEncoder()),
]
)
# The Combination of Numerical and Categorical
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, num_features),
("cat", categorical_transformer, cat_features),
]
)
# Overall Pipeline preprocessor + classifier
clf = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", LogisticRegression()),
]
)
clf.fit(X_train, y_train)
print("Train score: %.3f" % clf.score(X_train, y_train))
print("Test score: %.3f" % clf.score(X_test, y_test))
Train score: 0.802
Test score: 0.809
Marvin Martin
Daniel Nowak
MIT License Vic.ai 2023