The objective of this exercise is to practice various steps of data preprocessing and feature engineering.
The scenario is the preparation of data for a ML multilinear regressions.
The dataset used is the "Climate Weather Surface of Brazil - Hourly", wich is available at Kaggle.
It contains hourly climate data taken from weather stations in Brasil, taken between 2000 and 2021.
This exercise is broken down as follows:
Part I
- Load data
- Inspect data
Part II
- Format features
- Clean messy data
- Remove duplicate values
Part III
- Treat missing values
- Imputation
Part IV
- Remove strongly correlated features
- Remove outliers
Part V
- Aggregate features
- Encode categorical features
- Feature scaling
- Dimensionality reduction and feature decomposition
Part VI
- Sample and balance