Data preparation and transformation exercise

The objective of this exercise is to practice various steps of data preprocessing and feature engineering.

The scenario is the preparation of data for a ML multilinear regressions.

The dataset used is the "Climate Weather Surface of Brazil - Hourly", wich is available at Kaggle.

It contains hourly climate data taken from weather stations in Brasil, taken between 2000 and 2021.

This exercise is broken down as follows:

Part I

Load data
Inspect data

Part II

Format features
Clean messy data
Remove duplicate values

Part III

Treat missing values
Imputation

Part IV

Remove strongly correlated features
Remove outliers

Part V

Aggregate features
Encode categorical features
Feature scaling
Dimensionality reduction and feature decomposition

Part VI

Sample and balance