kmedoid-discretizer

Adaptative Kmedoid discretizer for numerical feature engineering.

Description

kmedoid-discretizer (Adaptative Kmedoid discretizer) allows to discritize numerical feature into n_bins using Kmedoids Clustering algrorithm compatible sklearn (Alternative to sklearn KBinsDiscretizer). With this implemenation, we can have:

A custom number of bins for each numeral feature. Kmedoids will be run for each columns.
Adapt the number of bins dynamically whenever this one is two high (more precesly when two centroids are assigned to the same data point.)
Multiple Backends are possible: serial, multiprocessing, and ray to speed up the Kmedoids compuation.
Mainly use Pandas DataFrame and Numpy array.

Install

pip install git+ssh://git@github.com/Vic-ai/kmedoid-discretizer.git

Play with the code and run it locally without pip

git clone git@github.com:Vic-ai/kmedoid-discretizer.git

Download poetry (make poetry-download). See poetry doc: https://python-poetry.org/docs/
Install the dev requirements into a virtualenv. (make install)

Usage

Basic Usage

Here is the Basic use-case data

# Fake training set
X = pd.DataFrame.from_dict({f"feature": [1, 2, 2, 3]})
# Fake Testing set
X_test = pd.DataFrame.from_dict({f"feature": [0, 2, 5]})

Ordinal encoding

discretizer = KmedoidDiscretizer(2)
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)

Onehot encoding

discretizer = KmedoidDiscretizer(2, encoding="onehot-dense")
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)

   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  1.0  0.0
3      3  0.0  1.0
   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  0.0  1.0

Advanced Usage Titanic (Sklearn Pipeline)

Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from kmedoid_discretizer.discretizer import KmedoidDiscretizer
from kmedoid_discretizer.utils.utils_external import PandasSimpleImputer

np.random.seed(0)

Titanic Dataset

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

cat_features = ["pclass", "sex"]
num_features = ["age", "fare", "sibsp", "parch"] # The one we will discritize

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

Training Pipeline

# Numerical Transformer Pipeline
numeric_transformer = Pipeline(
    steps=[
        ("imputer", PandasSimpleImputer(strategy="median")),
        ("discretizer", KmedoidDiscretizer(
                            n_bins=[8, 5, 7, 7],
                            encode="onehot-dense",
                            backend="serial",
                            verbose=True,
                            seed=0,
                        )),
    ]
)

# Categorical Transformer Pipeline
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder()),
    ]
)

# The Combination of Numerical and Categorical
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

# Overall Pipeline preprocessor + classifier
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)

clf.fit(X_train, y_train)
print("Train score: %.3f" % clf.score(X_train, y_train))
print("Test score: %.3f" % clf.score(X_test, y_test))

Train score: 0.802
Test score: 0.809

Contributors

Marvin Martin

Daniel Nowak

License

MIT License Vic.ai 2023

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
kmedoid_discretizer		kmedoid_discretizer
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kmedoid-discretizer

Description

Install

Play with the code and run it locally without pip

Usage

Basic Usage

Ordinal encoding

Onehot encoding

Advanced Usage Titanic (Sklearn Pipeline)

Libraries

Titanic Dataset

Training Pipeline

Contributors

License

About

Releases

Packages

Languages

License

Vic-ai/kmedoid-discretizer

Folders and files

Latest commit

History

Repository files navigation

kmedoid-discretizer

Description

Install

Play with the code and run it locally without pip

Usage

Basic Usage

Ordinal encoding

Onehot encoding

Advanced Usage Titanic (Sklearn Pipeline)

Libraries

Titanic Dataset

Training Pipeline

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages