Exercise 3: Data Cleaning and Preprocessing

The processes for identifying and handling missing values in the Titanic dataset using methods such as dropna() and fillna(). We will also transform data types as necessary.

Step 1: Identify Missing Values

Check for Missing Values:
- After loading the dataset into a pandas DataFrame, we can use the isnull() and sum() methods to check for missing values.
```
df.isnull().sum()
```

Step 2: Handle Missing Values

Decide on a Strategy:
- We can either remove rows/columns with missing values using dropna() or fill missing values using fillna().
Remove Missing Values:
- To remove rows with any missing values, use dropna().
```
df_dropped = df.dropna()
```
- To remove columns with any missing values, use:
```
df_dropped_cols = df.dropna(axis=1)
```
Fill Missing Values:
- To fill missing values, we can use fillna(). For example, we can fill missing values in the Age column with the mean age:
```
df['Age'].fillna(df['Age'].mean(), inplace=True)
```
- We can fill missing values in the Embarked column with the most frequent value:
```
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
```

Step 3: Convert Data Types if Necessary

Check Data Types:
- Use dtypes to check the data types of each column.
```
df.dtypes
```
Convert Data Types:
- If needed, we can convert data types using the astype() method. For example, converting the Survived column to bool:
```
df['Survived'] = df['Survived'].astype(bool)
```

Step-by-Step Execution

Check for Missing Values:

import pandas as pd
df = pd.read_csv('train.csv')
missing_values = df.isnull().sum()
print(missing_values)

Handle Missing Values:

Fill Missing Values:

df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

Drop Rows/Columns with Missing Values (if needed):

df_dropped = df.dropna()  # To drop rows with any missing values
# df_dropped_cols = df.dropna(axis=1)  # To drop columns with any missing values

Convert Data Types if Necessary:

print(df.dtypes)
df['Survived'] = df['Survived'].astype(bool)

By following these steps, you will have identified and handled missing values in the Titanic dataset and converted data types if necessary.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exer3.md

exer3.md

Exercise 3: Data Cleaning and Preprocessing

Step 1: Identify Missing Values

Step 2: Handle Missing Values

Step 3: Convert Data Types if Necessary

Step-by-Step Execution

Contribution 🛠️

Files

exer3.md

Latest commit

History

exer3.md

File metadata and controls

Exercise 3: Data Cleaning and Preprocessing

Step 1: Identify Missing Values

Step 2: Handle Missing Values

Step 3: Convert Data Types if Necessary

Step-by-Step Execution

Contribution 🛠️