Skip to content

Latest commit

 

History

History
99 lines (76 loc) · 4.07 KB

exer3.md

File metadata and controls

99 lines (76 loc) · 4.07 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Exercise 3: Data Cleaning and Preprocessing

The processes for identifying and handling missing values in the Titanic dataset using methods such as dropna() and fillna(). We will also transform data types as necessary.

Step 1: Identify Missing Values

  1. Check for Missing Values:

    • After loading the dataset into a pandas DataFrame, we can use the isnull() and sum() methods to check for missing values.
    df.isnull().sum()

Step 2: Handle Missing Values

  1. Decide on a Strategy:

    • We can either remove rows/columns with missing values using dropna() or fill missing values using fillna().
  2. Remove Missing Values:

    • To remove rows with any missing values, use dropna().
    df_dropped = df.dropna()
    • To remove columns with any missing values, use:
    df_dropped_cols = df.dropna(axis=1)
  3. Fill Missing Values:

    • To fill missing values, we can use fillna(). For example, we can fill missing values in the Age column with the mean age:
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    • We can fill missing values in the Embarked column with the most frequent value:
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

Step 3: Convert Data Types if Necessary

  1. Check Data Types:

    • Use dtypes to check the data types of each column.
    df.dtypes
  2. Convert Data Types:

    • If needed, we can convert data types using the astype() method. For example, converting the Survived column to bool:
    df['Survived'] = df['Survived'].astype(bool)

Step-by-Step Execution

  1. Check for Missing Values:

    import pandas as pd
    df = pd.read_csv('train.csv')
    missing_values = df.isnull().sum()
    print(missing_values)
  2. Handle Missing Values:

    • Fill Missing Values:

      df['Age'].fillna(df['Age'].mean(), inplace=True)
      df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    • Drop Rows/Columns with Missing Values (if needed):

      df_dropped = df.dropna()  # To drop rows with any missing values
      # df_dropped_cols = df.dropna(axis=1)  # To drop columns with any missing values
  3. Convert Data Types if Necessary:

    print(df.dtypes)
    df['Survived'] = df['Survived'].astype(bool)

By following these steps, you will have identified and handled missing values in the Titanic dataset and converted data types if necessary.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors