This is a code repository for my thesis project aimed at distinguishing fake news articles from real news articles in an unsupervised early detection approach using clustering techniques. A secondary aim of my thesis was to investigate whether or not clustering the news articles in dataset into their respective topics and then fitting the clustering models on the each topic cluster would improve the overall clustering performance of the models. This was done because it was believed that articles written about Donald Trump for example would have different linguistic styles and semantic features than articles written about healthcare.
A number of packages are required. This can be found in the requirements.txt file. To install these packages simply pip install the packages by entering the following command in a terminal window after having navigated to the folder where the python files are located.
pip install -r requirements.txt
The dataset used in this thesis was the [FakeNewsNet] (https://github.com/KaiDMML/FakeNewsNet) dataset and can be downloaded following the instructions its repository. Once the data has been downloaded, which is unfortunately a time consuming processes, it must be saved to a folder named 'Dataset'. Once the dataset has been downloaded, a secondary dataset must be downloaded from https://www.dropbox.com/s/gho59cezl43sov8/FakeNewsNet-master.zip?dl=0 and must be saved in the same folder as the FakeNewsNet dataset. This dataset is only used to train the Doc2vec model used in the experiment.
Once the datasets have been downloaded, to run the whole experiment simply navigate to the location of the python files and enter the following in a terminal window
python pipeline.py
This will run the whole experimental process of extracting the necessary data from the downloaded files, then it will preprocess it, then the doc2vec models will be trained and then the topic detection experiment will run and finally the fake news detection phase will run.
Unfortunately a large portion of the handcrafted features used in this project come from the LIWC lexicon which is a proprietary piece of software. Its unclear as to whether or not it would be ok to simply upload the features it creates to then go and create the handcrafted feature set so it was decided to err on the side of caution and not upload the features it creates. The handcrafted feature set is therefore excluded from the pipeline of the experiment and all code related to it is commented out of each file. Fortunately the feature set proves to be a poor feature set to learn to distinguish fake news from real news in an unuspervised setting.
Fake news poses a significant threat to the world’s democratic systems. It is an ever-growing, ever-changing problem with no clear solution in sight. Interest in fake news has grown substantially in recent years with fake news hitting mainstream headlines across the world with some people believing it played a role in the election of Donald Trump in the U.S presidential election. With public interest in fake news growing, people are looking for answers to this problem. Research into fake news today shows promising results but still faces a number of challenges. Perhaps the most significant is that current approaches rely heavily on good quality labelled data, which is mainly obtained through fact-checking resources such as Snopes or PolitiFact. This has its drawbacks. It cannot scale to the increasing volume of fake news being produced, nor can it scale to the expanding subject areas fake news is infesting. There is a clear need for unsupervised approaches that do not rely on labelled data. As unsupervised approaches are often less accurate than supervised approaches, it is believed that an unsupervised approach would act as an early detection method, the first line of defence in a fake news detection ecosystem. As such, there are limited features an early detection approach can rely upon. In the early stages of a fake news article’s lifecycle, social context features such as source information or social interaction information may not be available. Therefore an unsupervised early detection approach that is aimed at distinguishing real news articles from fake news articles can only rely on the news content of a news article, such as the text of the article.
The approach developed in this project makes use of only the text of the article. It represents this text as a doc2vec embedding and attempts to learn latent features of this article using an autoencoder with a clustering objective function. Comparisons between the results of the autoencoder and a baseline K-means model are made. Further to this, clustering news articles into topics and training an unsupervised model on each cluster may improve the overall clustering performance. An article written about Donald Trump could be very different from one written about healthcare, for example. Clustering articles into topics could be considered a form of feature reduction, allowing models to better learn features that distinguish real news from fake news when features that distinguish one topic from another are removed. The average performance of the models trained on the topic clusters are compared against the performance of the models trained on the whole dataset.
The findings of the project are that the autoencoder could not learn latent features of the article that improved its clustering performance when measured by clustering accuracy, normalised mutual information and adjusted rand index. The baseline K-means model outperformed it. Signs were observed that indicate training unsupervised models on data that has been clustered into topics slightly improves upon the performance of the models when trained on the whole dataset. However, the results in both cases indicate near-random labelling. The results of this project indicate that for unsupervised approaches text alone is not sufficient for models to learn features of fake news to distinguish it from real news. Moreover, while clustering news into topics shows signs of improving the clustering performance of unsupervised models, the improvements do not warrant further investigation into this line of research before a form of text representation is tested that captures features of fake news that better distinguishes it from real news.