This repository contains a Jupyter notebook documenting the steps taken to create the "In The Wild" dataset, now available on Kaggle. The dataset contains real and fake audios for deepfake detection.
The dataset was originally sourced from Deepfake Total. Due to format issues with the downloaded file, additional steps were required to prepare the dataset for upload to Kaggle.
The Jupyter notebook includes:
- Downloading the dataset
- Extracting the files
- Organizing the files into
real
andfake
folders - Uploading the dataset to Kaggle
Clone this repository and open the Jupyter notebook to follow the steps taken.
git clone https://github.com/yourusername/In-The-Wild-Dataset-Process.git
cd In-The-Wild-Dataset-Process
jupyter notebook In_The_Wild_Dataset_Process.ipynb
- The dataset was downloaded form Deepfake Total.
- The downloaded file had no extension, which caused issues during extraction.
!unzip /kaggle/input/in-the-wild-dataset/download -d /kaggle/working
!ls /kaggle/working/
- After extraction, we separated the files into real and fake folders based on their content.
def process_files(folder_path, csv_path):
# Sort files in the folder
files = sorted(os.listdir(folder_path))
# Read CSV file into a DataFrame
df = pd.read_csv(csv_path)
# Iterate over sorted files
for filename in files:
if filename.endswith('.wav'):
file_id = os.path.splitext(filename)[0] # Extract file ID from filename
file_id_with_extension = filename # Get file ID with extension for printing
# Get label from DataFrame based on file ID
label = df.loc[df['file'] == file_id_with_extension, 'label'].values[0]
# Create folder for label if it doesn't exist
label_folder = os.path.join(folder_path, label)
os.makedirs(label_folder, exist_ok=True)
# Move file to corresponding label folder
source_file = os.path.join(folder_path, filename)
destination_file = os.path.join(label_folder, filename)
shutil.move(source_file, destination_file)
print(f"Moved {filename} to {label_folder}")
print("Data split completed successfully!")
folder_path = '/kaggle/working/release_in_the_wild'
csv_path = '/kaggle/working/modified_meta.csv'
process_files(folder_path, csv_path)
- Finally, the organized dataset was uploaded to Kaggle. You can find the dataset and notebook links above.
- Abdalla Mohammed
- Mohammed Abdeldayem
- We welcome contributions! Please feel free to submit pull requests or open issues for any suggestions or improvements.
- This project is licensed under the Apache License, Version 2.0. License - see the LICENSE file for details.