In-The-Wild Dataset Process

This repository contains a Jupyter notebook documenting the steps taken to create the "In The Wild" dataset, now available on Kaggle. The dataset contains real and fake audios for deepfake detection.

Overview

The dataset was originally sourced from Deepfake Total. Due to format issues with the downloaded file, additional steps were required to prepare the dataset for upload to Kaggle.

The Jupyter notebook includes:

Downloading the dataset
Extracting the files
Organizing the files into real and fake folders
Uploading the dataset to Kaggle

Usage

Clone this repository and open the Jupyter notebook to follow the steps taken.

git clone https://github.com/yourusername/In-The-Wild-Dataset-Process.git
cd In-The-Wild-Dataset-Process
jupyter notebook In_The_Wild_Dataset_Process.ipynb

Kaggle Links

Steps Followed

1. Downloading the Dataset

The dataset was downloaded form Deepfake Total.

2. Extracting the Files

The downloaded file had no extension, which caused issues during extraction.

!unzip /kaggle/input/in-the-wild-dataset/download -d /kaggle/working
!ls /kaggle/working/

3. Organizing the Files

After extraction, we separated the files into real and fake folders based on their content.

def process_files(folder_path, csv_path):
    # Sort files in the folder
    files = sorted(os.listdir(folder_path))
    
    # Read CSV file into a DataFrame
    df = pd.read_csv(csv_path)
    
    # Iterate over sorted files
    for filename in files:
        if filename.endswith('.wav'):
            file_id = os.path.splitext(filename)[0]  # Extract file ID from filename
            file_id_with_extension = filename  # Get file ID with extension for printing
            
            # Get label from DataFrame based on file ID
            label = df.loc[df['file'] == file_id_with_extension, 'label'].values[0]
            
            # Create folder for label if it doesn't exist
            label_folder = os.path.join(folder_path, label)
            os.makedirs(label_folder, exist_ok=True)
            
            # Move file to corresponding label folder
            source_file = os.path.join(folder_path, filename)
            destination_file = os.path.join(label_folder, filename)
            shutil.move(source_file, destination_file)
            print(f"Moved {filename} to {label_folder}")
    print("Data split completed successfully!")

folder_path = '/kaggle/working/release_in_the_wild'
csv_path = '/kaggle/working/modified_meta.csv'
process_files(folder_path, csv_path)

4. Creating the kaggle dataset from the outputs of the notebook

Finally, the organized dataset was uploaded to Kaggle. You can find the dataset and notebook links above.

Contibutors

Abdalla Mohammed
Mohammed Abdeldayem
We welcome contributions! Please feel free to submit pull requests or open issues for any suggestions or improvements.

Licence

This project is licensed under the Apache License, Version 2.0. License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
in-the-wild-preparation.ipynb		in-the-wild-preparation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

In-The-Wild Dataset Process

Overview

Usage

Kaggle Links

Steps Followed

1. Downloading the Dataset

2. Extracting the Files

3. Organizing the Files

4. Creating the kaggle dataset from the outputs of the notebook

Contibutors

Licence

About

Releases

Languages

License

Abdalla312/In-The-Wild-Dataset-Process

Folders and files

Latest commit

History

Repository files navigation

In-The-Wild Dataset Process

Overview

Usage

Kaggle Links

Steps Followed

1. Downloading the Dataset

2. Extracting the Files

3. Organizing the Files

4. Creating the kaggle dataset from the outputs of the notebook

Contibutors

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Languages