This Python Neural Network program was a part of a group project, where I was in charge of training and testing a sentiment classifer based on an LSTM Model.
The subject of bias in the context of NLP datasets, as well as the algorithms trained on these datasets, has been of increasing concern as the use of these algorithms becomes more widespread. In response to these concerns, methods to identify and mitigate bias, both in datasets and in the word embeddings derived from these datasets, have been developed. In our work, we examine one such method of bias-mitigation for word embeddings, as presented by Manzini et al., and attempt to assess the performance of the debiased embeddings in comparison with non-debiased embeddings in the task of sentiment analysis. We find that, under most circumstances, the debiased embeddings lead to decreased performance compared to the non-debiased embeddings.
The word2vec word embeddings are pre-trained on Reddit L2 corpus (Rabinovich et al., 2018) using approximately 56 million sentences of posts and comments from users. This data is limited to the USA to leverage the extensive studies of social bias done here. Sentiment140, a corpus of 1.6 million tweets pre-labeled with positive or negative sentiment, is used in combination with the word2vec embeddings to train our sentiment classifier. For testing, we only use the Sentiment140 corpus. We implemented a cross-validation method, in which we trained our classifier on a random selection of 80% of the selected training data, and used the remaining 20% to test the performance of the algorithm after the training.
The Word2Vec word embeddings we used were taken directly from the links provided by Manzini et al. At this link, several sets of embeddings can be found based on data taken from Reddit.com, including a baseline (non-debiased) set of embeddings, and a series of embeddings resulting from applying hard debiasing methods to the baseline according to racial, gender, and religious bias. Using these embeddings in addition to a dataset of sentiment-labeled tweets. This is a binary classification thus there are two possible sentiment classifications, positive or negative) sentiment. We preprocess the tweets to remove links and unrecognized terms like user tags and emoticons.
The LSTM, a variety of RNNs, is trained using a batch size of 64, with a training duration of 600 epochs. A batch size of 64 was a standard among other researchers. Training with a smaller batch size can be slower, because the model has to make more updates for each epoch. Alternatively, a larger batch size could have led to faster training but given hardware limitations and the minimal increase in performance, we used a batch size of 64. After training, we evaluate the model's ability to predict the sentiment of tweets based on performance metrics: F-score, precision, recall, and accuracy. We then compare these statistics to evaluate the change in performance as we train the algorithm on the three de-biased word embeddings for each social bias. We allowed for early stoppage with a patience value of ten epochs to prevent the possibility of overfitting.
Conventional RNN’s suffer from the vanishing gradient problem, in which the gradient becomes so small that the weights of the nodes in the Neural Network stop changing. An LSTM can retain additional information for longer than a conventional RNN, which allows previous data to continue to have an impact on the weights of the nodes. However, these extra capabilities come at the cost of extra computational resources. As a result, in order to train our algorithm, we were forced to reduce the size of our training data from its original size of 1.6 million tweets. To reduce the loss, we use an Adam optimization algorithm for our training. We used the Python libraries and frameworks Keras and TensorFlow to get the functions necessary to build the neural network, and the libraries pandas and NumPy to help us work with the datasets and embedding files.
Generally speaking, there was a small but consistent decrease in performance when switching from non-debiased embeddings to debiased embeddings. Given we used a large corpus, a small change is still significant. We expected this decrease to occur based on our review of previous work, where after debiasing sentiment classification decreased f-score from 84% to 83% (Hube, 2020). In Manzini’s research, their downstream tasks’ “performance changes are of questionable statistical significance: (Manzini, 2019). Based on that reasoning, while the debiasing procedures attempt to preserve additional information contained in the word embeddings, some of this information is inevitably lost during the debiasing process. Compared to the performance of the non-debiased embeddings, the gender-debiasing had a change in the accuracy -0.039, Fscore -0.043, Precision -0.026, and Recall -0.05, approximately a 2 to 5 percent decrease across the performance metrics.
Similar decreases were observed with the other sets of debiased embeddings. While the racially debiased embeddings performed better in the recall by 0.002. Although the difference is minimal. they still performed worse compared to the non-debiased set of embeddings. We argue these results are significant given that gender debiasing is known to preserve the utility of downstream tasks (Bolukbasi et al., 2016). The goal of debiasing is to preserve the utility of word embeddings but we show that the utility is decreased. As we predicted in related work, our decrease in accuracy was greater than that of Hube’s who tested debiasing impact on sentiment classification (Hube, 2017).
To test the quality of our system and ensure it is better than random chance we created a baseline system, which simply picked a random number from the set {0,1} and assigned it to each tweet. This system received an accuracy value of 0.499, much lower than our sentiment classifier whose accuracy is 0.76. Whether we used debiased or non-debiased word embeddings, our system substantially outperformed the baseline, indicating that both systems are better than random chance. Thus, we can assume that our classifier can be used to predict the sentiment of tweets.