Kaggle Notebook 🡪 https://www.kaggle.com/code/wojteksy/santander-hybrid-recommendation-system
The purpose of this project is to predict the Sale Price. Firstly, I perform exploratory data analysis which indicate me how to prepare the data. To have more automation, I decided to make pipelines with custom transformers. After bundled all the pipelines, I use hyperparameter tuning and finally compute the score.
- Python
- Scikit-Learn
- Pandas
- Seaborn
- Matplotlib
- NumPy
- SciPy
- category_encoders
- xgboost
- Jupyter Notebook
Dataset derives from Kaggle competition about housing prices in Ames.
https://www.kaggle.com/competitions/home-data-for-ml-course/data
I looked at the data by dividing them into univariate and bivariate analysis.
High skewness of variables will be normalized in feature engineering.
A few outliers need to be removed.
Highly correlated variables will be dropped in feature engineering to avoid multicollinearity.
Columns 'PoolQC', 'MiscFeature' and 'Alley' has too many missing values. They will be dropped in feature engineering.
In this section, I remove outliers and columns which have too many missing values and can lead to multicollinearity.
I divided this section by numerical columns and categorical columns. All functions I created can be input in steps of pipeline.
- CustomImputer
- AddAttributes
- DropCorrFeatures
- SkewedFeatures
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, imputer, strategy, fill_value=0):
self.imputer = imputer
self.strategy = strategy
self.fill_value = fill_value
def fit(self, X, y=None):
self.imputer = self.imputer(strategy=self.strategy, fill_value = self.fill_value)
self.imputer.fit(X, y)
return self
def transform(self, X):
X_imp_tran = self.imputer.transform(X)
X_imputer = pd.DataFrame(X_imp_tran, index=X.index, columns=X.columns)
return X_imputer
class SkewedFeatures(BaseEstimator, TransformerMixin):
def __init__(self, skew_threshold=0.8):
self.skew_threshold = skew_threshold
def fit(self, X, y=None):
skew_features = X.select_dtypes(exclude='object').apply(lambda x: skew(x))
self.skew_features_high = skew_features[abs(skew_features) > self.skew_threshold].index
return self
def transform(self, X):
X[self.skew_features_high] = np.log1p(X[self.skew_features_high])
return X
- CustomImputer
- OrdinalEncoder
CentralAir_map = {'Y': 1, 'N': 0}
Street_map = {'Pave': 1, 'Grvl': 0}
binary_mapping = [{'col': col, 'mapping': globals()[col + '_map']}
for col in cat_bin]
The most important and satisfying section. Pipelines with which we can combine all data preprocessing!
# Preprocessing for numerical data
num_transformer = Pipeline(steps=[
('num_imputer', CustomImputer(SimpleImputer, strategy='median')),
('adder', AddAttributes()),
('drop_corr', DropCorrFeatures()),
('skew_func', SkewedFeatures()),
('std_scaler', StandardScaler())
])
# Preprocessing for categorial data
cat_transformer_ordinal = Pipeline(steps=[
('cat_ordinal_imputer', CustomImputer(SimpleImputer, strategy='constant', fill_value='NA')),
('ordinal_encoder', ce.OrdinalEncoder(mapping = ordinal_mapping))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', num_transformer, num_col_transform),
('cat_ordinal', cat_transformer_ordinal, cat_ordinal),
('cat_ordinal_num', cat_transformer_ordinal_num, cat_ordinal_num),
('cat_bin', cat_transformer_bin, cat_bin),
('cat_nominal', cat_transformer_nominal, cat_nominal)
], remainder='passthrough')
I use XGBRegressor and hyperparameter tuning to evaluate score.
There is a field to improve the score by adjusting more hyperparameters.