Skip to content

Explore the "In-The-Wild Dataset Process" repository, documenting the steps taken to create a deepfake detection dataset sourced from Deepfake Total. This repository includes a detailed Jupyter notebook outlining the process of downloading, extracting, organizing, and uploading the dataset to Kaggle.

License

Notifications You must be signed in to change notification settings

Abdalla312/In-The-Wild-Dataset-Process

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

In-The-Wild Dataset Process

This repository contains a Jupyter notebook documenting the steps taken to create the "In The Wild" dataset, now available on Kaggle. The dataset contains real and fake audios for deepfake detection.

Overview

The dataset was originally sourced from Deepfake Total. Due to format issues with the downloaded file, additional steps were required to prepare the dataset for upload to Kaggle.

The Jupyter notebook includes:

  1. Downloading the dataset
  2. Extracting the files
  3. Organizing the files into real and fake folders
  4. Uploading the dataset to Kaggle

Usage

Clone this repository and open the Jupyter notebook to follow the steps taken.

git clone https://github.com/yourusername/In-The-Wild-Dataset-Process.git
cd In-The-Wild-Dataset-Process
jupyter notebook In_The_Wild_Dataset_Process.ipynb

Kaggle Links

Steps Followed

1. Downloading the Dataset

2. Extracting the Files

  • The downloaded file had no extension, which caused issues during extraction.
!unzip /kaggle/input/in-the-wild-dataset/download -d /kaggle/working
!ls /kaggle/working/

3. Organizing the Files

  • After extraction, we separated the files into real and fake folders based on their content.
def process_files(folder_path, csv_path):
    # Sort files in the folder
    files = sorted(os.listdir(folder_path))
    
    # Read CSV file into a DataFrame
    df = pd.read_csv(csv_path)
    
    # Iterate over sorted files
    for filename in files:
        if filename.endswith('.wav'):
            file_id = os.path.splitext(filename)[0]  # Extract file ID from filename
            file_id_with_extension = filename  # Get file ID with extension for printing
            
            # Get label from DataFrame based on file ID
            label = df.loc[df['file'] == file_id_with_extension, 'label'].values[0]
            
            # Create folder for label if it doesn't exist
            label_folder = os.path.join(folder_path, label)
            os.makedirs(label_folder, exist_ok=True)
            
            # Move file to corresponding label folder
            source_file = os.path.join(folder_path, filename)
            destination_file = os.path.join(label_folder, filename)
            shutil.move(source_file, destination_file)
            print(f"Moved {filename} to {label_folder}")
    print("Data split completed successfully!")

folder_path = '/kaggle/working/release_in_the_wild'
csv_path = '/kaggle/working/modified_meta.csv'
process_files(folder_path, csv_path)

4. Creating the kaggle dataset from the outputs of the notebook

  • Finally, the organized dataset was uploaded to Kaggle. You can find the dataset and notebook links above.

Contibutors

Licence

  • This project is licensed under the Apache License, Version 2.0. License - see the LICENSE file for details.

About

Explore the "In-The-Wild Dataset Process" repository, documenting the steps taken to create a deepfake detection dataset sourced from Deepfake Total. This repository includes a detailed Jupyter notebook outlining the process of downloading, extracting, organizing, and uploading the dataset to Kaggle.

Resources

License

Stars

Watchers

Forks

Releases

No releases published