https://sugatagh.github.io/dsml/projects/natural-language-processing-with-disaster-tweets/
https://www.kaggle.com/sugataghosh/natural-language-processing-with-disaster-tweets
-
Disaster-related tweets have the potential to alert relevant authorities early on so that they can take action to reduce damage and possibly save lives.
-
In this project, we attempt to predict whether a given tweet indicates a real disaster or not.
-
A detailed exploratory data analysis on the dataset is carried out.
-
We consider a number of text normalization processes, namely conversion to lowercase, removal of whitespaces, removal of punctuations, removal of unicode characters (including HTML tags, emojis, and URLs starting with http), substitution of acronyms, substitution of contractions, removal of stop words, spelling correction, stemming, lemmatization, discardment of non-alphabetic words, and retention of relevant parts of speech.
-
We implement bag of words text representation and extend the analysis to bag of bigrams as well as a mixture representation incorporating both words and bigrams.
-
Next, we implement TF-IDF text representation. Similar to the previous setup, we carry out unigram, bigram, and mixture analysis.
-
Finally, we use word2vec embedding for text representation.
-
For each text representation setup, we apply a number of classifiers, namely logistic regression, k-nearest neighbors classifier, decision tree, support vector machine with radial basis function kernel, random forest, stochastic gradient descent, ridge classifier, XGBoost classifier, and AdaBoost classifier, and compare their performances in terms of the average F1-score obtained from
$5$ repetitions of$6$ -fold cross-validation. -
The support vector machine classifier with a radial basis function kernel acting on the embedded data obtained through the word2vec algorithm produces the best result in terms of the average
$F_1$ -score obtained from$5$ repetitions of$6$ -fold cross-validation. It achieves an average$F_1$ -score of$0.783204$ .