The processes for identifying and handling missing values in the Titanic dataset using methods such as dropna()
and fillna()
. We will also transform data types as necessary.
-
Check for Missing Values:
- After loading the dataset into a pandas DataFrame, we can use the
isnull()
andsum()
methods to check for missing values.
df.isnull().sum()
- After loading the dataset into a pandas DataFrame, we can use the
-
Decide on a Strategy:
- We can either remove rows/columns with missing values using
dropna()
or fill missing values usingfillna()
.
- We can either remove rows/columns with missing values using
-
Remove Missing Values:
- To remove rows with any missing values, use
dropna()
.
df_dropped = df.dropna()
- To remove columns with any missing values, use:
df_dropped_cols = df.dropna(axis=1)
- To remove rows with any missing values, use
-
Fill Missing Values:
- To fill missing values, we can use
fillna()
. For example, we can fill missing values in theAge
column with the mean age:
df['Age'].fillna(df['Age'].mean(), inplace=True)
- We can fill missing values in the
Embarked
column with the most frequent value:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
- To fill missing values, we can use
-
Check Data Types:
- Use
dtypes
to check the data types of each column.
df.dtypes
- Use
-
Convert Data Types:
- If needed, we can convert data types using the
astype()
method. For example, converting theSurvived
column tobool
:
df['Survived'] = df['Survived'].astype(bool)
- If needed, we can convert data types using the
-
Check for Missing Values:
import pandas as pd df = pd.read_csv('train.csv') missing_values = df.isnull().sum() print(missing_values)
-
Handle Missing Values:
-
Fill Missing Values:
df['Age'].fillna(df['Age'].mean(), inplace=True) df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
-
Drop Rows/Columns with Missing Values (if needed):
df_dropped = df.dropna() # To drop rows with any missing values # df_dropped_cols = df.dropna(axis=1) # To drop columns with any missing values
-
-
Convert Data Types if Necessary:
print(df.dtypes) df['Survived'] = df['Survived'].astype(bool)
By following these steps, you will have identified and handled missing values in the Titanic dataset and converted data types if necessary.
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.