Skip to content

๐Ÿšจ Classifying disaster-related tweets using deep learning ๐Ÿค– to identify real vs. fake news during crises ๐ŸŒ. ๐Ÿ” NLP techniques help clean and preprocess data for accurate predictions ๐Ÿ“Š.

License

Notifications You must be signed in to change notification settings

sergio11/disasters_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

29 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒ Advanced Classification of Disaster-Related Tweets Using Deep Learning ๐Ÿšจ

In this project, I am tackling an exciting challenge: classifying tweets to determine whether theyโ€™re related to disasters or not. Using cutting-edge deep learning techniques, this model sifts through tweet data and helps us understand how social media reacts to crises in real-time. Inspired by the "NLP with Disaster Tweets" challenge, this project is enhanced with additional data to give us deeper insights into disaster-related topics.

๐Ÿ™ I would like to extend my heartfelt gratitude to Santiago Hernรกndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

โš ๏ธ Disclaimer

This project was developed for educational and research purposes only. It is an academic exploration of deep learning techniques for classifying disaster-related tweets.

The models and analyses presented in this repository are not intended for real-world emergency response or crisis management. They serve as a proof of concept and have not been rigorously validated for reliability, accuracy, or bias in diverse social media environments.

While this project leverages publicly available datasets and references existing research, users should not rely on its outputs for making emergency decisions or disseminating crisis-related information. Always verify critical information from official sources.

๐ŸŒŸ Explore My Other Cutting-Edge AI Projects! ๐ŸŒŸ

If you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:

๐Ÿ“Š Dataset Overview

๐Ÿ—บ๏ธ The Dataset

This dataset includes over 11,000 tweets focused on major disasters, like the COVID-19 outbreak, Taal Volcano eruption, and the bushfires in Australia. Itโ€™s a snapshot of how people react and communicate during global crises.

The data includes:

  • Tweets: The text content of the tweet ๐Ÿ“ฑ
  • Keywords: Disaster-related keywords like โ€œearthquakeโ€ or โ€œfloodโ€ ๐ŸŒช๏ธ
  • Location: Geographical information when available ๐ŸŒ

Collected on January 14th, 2020, it represents critical moments in recent history, including:

  • The Taal Volcano eruption (Philippines ๐ŸŒ‹)
  • COVID-19 (global pandemic ๐Ÿฆ )
  • Bushfires in Australia (Australia ๐Ÿ”ฅ)
  • The Iranian downing of flight PS752 (international tragedy โœˆ๏ธ)

โš ๏ธ Caution

This dataset contains tweets that may include offensive language ๐Ÿ˜ฌ. Please proceed with caution during analysis.

๐ŸŽฏ Project Goals

๐Ÿ’ก Why I am Doing This

The goal of this project is clear: build a deep learning model that can classify tweets as related to disasters or not. Here's how we're approaching it:

  1. Enriching the dataset: By adding manually classified tweets, we can boost the quality and size of our dataset ๐Ÿ“ˆ.
  2. Building a robust model: Using deep learning and NLP techniques to extract meaningful features from the data ๐Ÿ”.
  3. Classifying tweets: The model will distinguish between disaster-related and non-disaster tweets, helping us understand patterns in social media behavior during crises.

๐Ÿ’ช Why This Matters

Why is it important to classify disaster-related tweets? Here are a few reasons:

  • Emergency Response: Helps first responders prioritize real-time, crucial information ๐Ÿ†˜.
  • Better Resource Allocation: Directs attention to actual disasters and helps prevent the spread of misinformation ๐Ÿค–.
  • Misinformation Control: Filters out false information during global crises and ensures people are getting accurate updates ๐Ÿ“‰.

๐Ÿ”ง Methodology

1. Data Preprocessing ๐Ÿงน

Before we can train our deep learning model, we need to clean up the data. This includes:

  • Removing URLs: Twitter links wonโ€™t help us classify the content, so we remove them ๐Ÿ”—โŒ.
  • Eliminating Emojis: While fun, emojis don't add value in this classification task ๐Ÿ˜œโŒ.
  • Removing HTML Tags & Punctuation: Ensuring weโ€™re working with clean text ๐ŸŒโœ‚๏ธ.
  • Tokenizing the Text: Breaking down the tweets into individual words or tokens ๐Ÿง .

2. Model Architecture ๐Ÿ—๏ธ

The deep learning model implemented for this project follows a feedforward neural network architecture designed to classify disaster-related tweets. Here's an overview of the model structure:

  1. Input Layer:
    The model takes in preprocessed tweet data as input, where each tweet is represented by a vector of features (after vectorization using TF-IDF). The shape of the input is (X_train.shape[1],) which corresponds to the number of features for each tweet.

  2. Hidden Layers:

    • The first hidden layer has 16 neurons and uses the ReLU (Rectified Linear Unit) activation function. ReLU is commonly used for its ability to introduce non-linearity and prevent the vanishing gradient problem.
    • A Dropout Layer with a rate of 0.4 follows the first hidden layer to prevent overfitting by randomly setting 40% of the input units to 0 during training.
    • The second hidden layer also consists of 16 neurons and utilizes the ReLU activation function.
    • Another Dropout Layer with a rate of 0.4 is added to further mitigate overfitting.
  3. Output Layer:
    The output layer consists of a single neuron with a sigmoid activation function, which is suitable for binary classification tasks like this one (disaster vs. non-disaster tweet). The output will be a probability score between 0 and 1, where values closer to 1 indicate a disaster tweet.

  4. Compilation:
    The model is compiled using the Adam optimizer (a popular choice for training deep learning models) and binary cross-entropy loss, which is appropriate for binary classification tasks. Additionally, accuracy and precision metrics are used to evaluate the performance of the model.

3. Training the Model โณ

To ensure the model is trained effectively and prevents overfitting, I use the following methodology:

  1. Early Stopping:
    An EarlyStopping callback is used during training to monitor the validation loss. The training process stops if the validation loss does not improve for 5 consecutive epochs (defined by the patience parameter). This helps in preventing the model from overfitting the training data by stopping the training once it starts to generalize poorly.

  2. Model Fitting:
    The model is trained for a maximum of 100 epochs with a batch size of 1024. The training process is monitored with validation data (X_val, Y_val), allowing us to track the modelโ€™s performance on unseen data during training.

4. Evaluation & Insights ๐Ÿ“Š

After training, I evaluate the modelโ€™s performance on real-world examples to see how well it classifies tweets related to disasters.

  1. Prediction on Example Tweets:
    A set of example tweets is preprocessed using the same vectorizer applied to the training data. Then, the model predicts whether each tweet is related to a disaster or not. The output is a binary label (0 or 1), where 1 indicates a disaster-related tweet and 0 indicates a non-disaster tweet.

    Example tweets:

    • "A 6.2 magnitude earthquake has struck Concepciรณn. Coastal areas should evacuate due to tsunami risk."
    • "A red alert has been issued for wildfires in the Valparaรญso region. Residents are urged to evacuate immediately and follow emergency instructions."
    • "Strong shaking felt in the area. Possible aftershocks."
    • "Due to heavy rainfall, the Paranรก River has overflowed, causing floods in Santa Fe province. Donations of food and clothing are needed."
    • "Smoke in the air. Stay indoors and monitor local news."
    • "River levels rising. Avoid low-lying areas. Stay informed."

  1. Visualization of Loss and Accuracy:
    The training and validation loss, as well as accuracy, are plotted to visualize the modelโ€™s performance over time. These plots help us understand how well the model is learning during training and whether it is overfitting or generalizing well to the validation data.

    • Loss: Tracks how well the model minimizes the loss function during training.
    • Accuracy: Tracks the percentage of correct predictions the model makes on both training and validation data.

๐Ÿ“‰ Results

Training Progress

We track the modelโ€™s progress using training and validation loss, as well as accuracy. This helps us understand how well the model is learning and improving during the training process.

๐Ÿ”ฎ Conclusion

I successfully built a deep learning model capable of classifying tweets as disaster-related or not. The model performs well in distinguishing between genuine disaster tweets and irrelevant content, which is crucial for emergency response and misinformation control during crises.

๐ŸŒŸ Future Work

I am not stopping here! Thereโ€™s still a lot of potential to enhance this project:

  • More Data: The dataset can be further expanded with more labeled tweets from different events and locations ๐ŸŒŽ.
  • Advanced Models: Experiment with other techniques like Word2Vec or BERT for even better text representations ๐Ÿ“š.
  • Real-Time Deployment: Imagine deploying this model for real-time disaster monitoring on Twitter ๐Ÿฆ.

๐Ÿ“š References

โš ๏ธ Disclaimer

This project was developed for educational and research purposes only. It is an academic exploration of deep learning techniques for classifying disaster-related tweets.

The models and analyses presented in this repository are not intended for real-world emergency response or crisis management. They serve as a proof of concept and have not been rigorously validated for reliability, accuracy, or bias in diverse social media environments.

While this project leverages publicly available datasets and references existing research, users should not rely on its outputs for making emergency decisions or disseminating crisis-related information. Always verify critical information from official sources.

๐Ÿ™ Acknowledgments

A huge thank you to Vstepanenko for providing the dataset that made this project possible! ๐ŸŒŸ The dataset can be found on Kaggle. Your contribution is greatly appreciated! ๐Ÿ™Œ

๐Ÿ™ I would like to extend my heartfelt gratitude to Santiago Hernรกndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

Visitors Count

Please Share & Star the repository to keep me motivated.

License โš–๏ธ

This project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. ๐Ÿ› ๏ธ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. ๐Ÿ“„

Please note the following limitations:

  • The software is provided "as is", without any warranties, express or implied. ๐Ÿšซ๐Ÿ›ก๏ธ
  • If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. ๐Ÿ“‘
  • The license allows for commercial use, but you cannot claim ownership over the software itself. ๐Ÿท๏ธ

The goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.

MIT License

Copyright (c) 2024 Dream software - Sergio Sรกnchez 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

About

๐Ÿšจ Classifying disaster-related tweets using deep learning ๐Ÿค– to identify real vs. fake news during crises ๐ŸŒ. ๐Ÿ” NLP techniques help clean and preprocess data for accurate predictions ๐Ÿ“Š.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published