Skip to content

Data Cleanser

Sreeja Deb edited this page Aug 13, 2021 · 3 revisions

Before triggering the Azure Auto ML, our proposed framework (Auto Tune Model) helps improve the data quality of our input dataset using the Data Cleansing component. Since data is considered the currency for any machine learning model, it is very critical for the success of Machine Learning applications. The algorithms that we may use can be powerful, but without the relevant or right data training, our system may fail to yield ideal results. Data cleansing refers to identifying and correcting errors in the dataset that may negatively impact a predictive model. It refers to all kinds of tasks and activities to detect and repair errors in the data. This improves the quality of the training data for analytics and enables accurate decision-making. The function ‘autodatacleaner’ encompasses all the underlying features of the data cleansing component that are outlined below.

  1. Handle Missing Values:
    Data can have missing values for several reasons such as observations that were not recorded and data corruption. Handling missing data is important as many machine learning algorithms do not support data with missing values. If data is missing, we can either indicate missing values by simply creating a Missing category if the data is categorical or flagging and filling with 0 if it is numerical or apply imputation to fill the missing values. Hence, as part of the Data Cleansing component, we are applying imputation or dropping the columns in the dataset to fill all the missing values, which is decided based on a threshold of 50%. First, we replace all the white spaces or empty values with NaN except those in the middle. If more than half of the data in a column is NaN, we drop the column else we impute the missing values with median for numerical columns and mode for categorical columns. One limitation with dropping the columns is by dropping missing values, we drop information that may assist us in making better conclusions about the study. It may deny us the opportunity of benefiting from the possible insights that can be gotten from the fact that a particular value is missing. This can be handled by applying feature importance and understanding the significant columns in the dataset that can be useful for the predictive model which shouldn’t be dropped hence, treating this as an exception.

  1. Fix Structural Errors:
    After removing unwanted observations and handling missing values, the next thing we make sure is that the wanted observations are well-structured. Structural errors may occur during data transfer due to a slight human mistake or incompetency of the data entry personnel. Some of the things we will look out for when fixing data structure include typographical errors, grammatical blunders, and so on. The data structure is mostly concerned with categorical data. We are fixing these structural errors by removing leading/trailing white spaces and solving inconsistent capitalization for categorical columns.

  1. Encoding of Categorical Columns:
    In machine learning, we usually deal with datasets which contains multiple labels in one or more than one column. These labels can be in the form of words or numbers. The training data is often labeled in words to make it understandable or in human readable form. Label Encoding refers to converting the labels into numeric form to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. Hence, for Label encoding we are using the Label Encoder component of the python class sklearn preprocessing package. from sklearn.preprocessing import LabelEncoder Encode target labels with value between 0 and n_classes-1.

  1. Normalization:
    As most of the datasets have multiple features spanning varying degrees of magnitude, range, and units. This can deviate the ML model to be biased towards the dominant scale and hence make it as an obstacle for the machine learning algorithms as they are highly sensitive to these features. Hence, we are tackling this problem using normalization. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. We normalize our dataset using the MinMax scaling component of the python class sklearn preprocessing package: from sklearn.preprocessing import MinMaxScaler MinMax scaler transforms features by scaling each feature to a given range on the training set, e.g., between zero and one. It shifts and rescales the values so that they end up ranging between 0 and 1.
    Here’s the formula for normalization:
    X^'= (X- X_min)/(X_max⁡- X_min)
    Here,
    Xmax and Xmin are the maximum and the minimum values of the feature respectively.
    When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0

    On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1
    If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1
    The transformation is given by:
    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min
    where min, max = feature_range.