Natural Language Processing with Disaster Tweets

Project Report

https://sugatagh.github.io/dsml/projects/natural-language-processing-with-disaster-tweets/

Kaggle Notebook

https://www.kaggle.com/sugataghosh/natural-language-processing-with-disaster-tweets

Overview

Disaster-related tweets have the potential to alert relevant authorities early on so that they can take action to reduce damage and possibly save lives.
In this project, we attempt to predict whether a given tweet indicates a real disaster or not.
A detailed exploratory data analysis on the dataset is carried out.
We consider a number of text normalization processes, namely conversion to lowercase, removal of whitespaces, removal of punctuations, removal of unicode characters (including HTML tags, emojis, and URLs starting with http), substitution of acronyms, substitution of contractions, removal of stop words, spelling correction, stemming, lemmatization, discardment of non-alphabetic words, and retention of relevant parts of speech.
We implement bag of words text representation and extend the analysis to bag of bigrams as well as a mixture representation incorporating both words and bigrams.
Next, we implement TF-IDF text representation. Similar to the previous setup, we carry out unigram, bigram, and mixture analysis.
Finally, we use word2vec embedding for text representation.
For each text representation setup, we apply a number of classifiers, namely logistic regression, k-nearest neighbors classifier, decision tree, support vector machine with radial basis function kernel, random forest, stochastic gradient descent, ridge classifier, XGBoost classifier, and AdaBoost classifier, and compare their performances in terms of the average F1-score obtained from $5$ repetitions of $6$-fold cross-validation.
The support vector machine classifier with a radial basis function kernel acting on the embedded data obtained through the word2vec algorithm produces the best result in terms of the average $F_1$-score obtained from $5$ repetitions of $6$-fold cross-validation. It achieves an average $F_1$-score of $0.783204$.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md
english_acronyms_lowercase.json		english_acronyms_lowercase.json
english_contractions_lowercase.json		english_contractions_lowercase.json
natural-language-processing-with-disaster-tweets.ipynb		natural-language-processing-with-disaster-tweets.ipynb
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing with Disaster Tweets

Project Report

Kaggle Notebook

Overview

About

Releases

Packages

Languages

License

sugatagh/Natural-Language-Processing-with-Disaster-Tweets

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing with Disaster Tweets

Project Report

Kaggle Notebook

Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages